Discussion:
ZFS on Large SSD Arrays
(too old to reply)
Doug Dumitru
2013-10-29 19:51:41 UTC
Permalink
I have been doing some testing on large SSD arrays. The configuration is:

Xeon E5-1650
24 128GB "consumer" SSD
64GB RAM

This system is fast. At q=200 the drives will do 4K reads at > 800,000
IOPS. Doing linear writes, a raid-0 array can hit 6GB/sec.

I am testing this with a 300GB target ZVOL using a test program that does
O_DIRECT aligned random block IO. Compression and de-dupe are both turned
off.

In general, reads seem to saturate out just under 300,000 IOPS at q=200.
This is not that surprising and can be attributed to SHA overhead.

My concern is writes. One is that write IOPS are "pedestrian" for this
size of array at under 80,000 IOPS. 80K IOPS sounds like a lot, but the
array is capable of a lot more than this. This 80K also is quite variable
and is often a lot lower, depending on the pre-conditioning of the target
volume. On a particular test at q=10, random writes proceeded at 17,676
IOPS or 69.06MB/sec. This was a "100% default" zpool running raid-0.

My bigger concern is what is happening to the drives underneath. During
this test above, I watched the underlying devices with 'iostat' and they
were doing 365.27MB/sec of "actual writes" at the drive level. This is a
"wear amplification" of more than 5:1. For SSDs wear amplification is
important because it directly leads to flash wear out.

Just for fun, i re-ran the above tests with the zpool configured at
raidz3. With triple parity raid, the wear amplification jumped to 23:1.

This testing implies that ZFS is just not designed for pure SSD array
usage, at least with large arrays. This array is 3TB, but I have customers
running pure SSD array as large as 48TB and the trend implies that PB SSD
arrays are just around the corner. I suspect there is some tuning that can
help (setting the block size lower seems to help some), but I would like to
understand more of what is going on before I jump to extreme conclusions
(although 23:1 wear amplification is pretty extreme). Eventually, my
testing will become a paper, so if I am off base, I would like to not
embarrass myself.

Comments on tuning for pure SSD arrays would be very welcome. I have seen
some posts that imply the same issues have been seen elsewhere. For
example, someone tried a pair of Fusion-IO cards and was stuck at 400GB/sec
on writes. This setup is larger/faster than a pair of Fusion cards.
Again, any help would be appreciated.

To show the "write amplification", here is a 20 second snapshot from
'iostat -x 20 -m':

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 1172.30 0.00 57.27
100.04 0.19 0.16 0.00 0.16 0.13 14.76
sdd 0.00 0.00 0.00 1162.20 0.00 57.27
100.91 0.20 0.17 0.00 0.17 0.13 15.44
sde 0.00 0.00 3486.90 1159.95 27.24 57.27
37.25 1.04 0.22 0.23 0.20 0.13 62.50
sdg 0.00 0.00 0.00 1167.25 0.00 57.26
100.46 0.20 0.17 0.00 0.17 0.14 15.86
sdf 0.00 0.00 0.00 1170.75 0.00 57.26
100.16 0.19 0.16 0.00 0.16 0.13 15.56
sdi 0.00 0.00 3482.65 1155.95 27.21 57.26
37.29 1.05 0.23 0.24 0.20 0.13 62.00
sdc 0.00 0.00 0.00 1174.15 0.00 57.27
99.89 0.19 0.16 0.00 0.16 0.13 15.02
sdh 0.00 0.00 0.00 1170.00 0.00 57.26
100.22 0.20 0.17 0.00 0.17 0.13 15.62
sdj 0.00 0.00 0.00 1144.60 0.00 57.24
102.41 0.31 0.27 0.00 0.27 0.15 17.46
sdl 0.00 0.00 0.00 1136.80 0.00 57.24
103.11 0.44 0.38 0.00 0.38 0.16 18.40
sdp 0.00 0.00 0.00 1140.95 0.00 57.25
102.77 0.47 0.41 0.00 0.41 0.17 18.96
sdq 0.00 0.00 3488.40 1091.05 27.25 57.25
37.79 1.32 0.29 0.30 0.25 0.14 62.76
sdk 0.00 0.00 0.00 1138.95 0.00 57.24
102.92 0.39 0.35 0.00 0.35 0.16 18.74
sdm 0.00 0.00 3488.50 968.00 27.25 57.24
38.83 1.32 0.30 0.30 0.28 0.14 62.98
sdn 0.00 0.00 0.00 1150.80 0.00 57.25
101.89 0.53 0.46 0.00 0.46 0.17 19.12
sdo 0.00 0.00 0.00 1138.25 0.00 57.25
103.02 0.44 0.38 0.00 0.38 0.16 18.04
sdr 0.00 0.00 0.00 1168.45 0.00 57.30
100.43 0.19 0.17 0.00 0.17 0.12 14.58
sds 0.00 0.00 0.00 1170.05 0.00 57.30
100.29 0.21 0.18 0.00 0.18 0.14 15.98
sdt 0.00 0.00 0.00 949.10 0.00 57.30
123.64 0.23 0.25 0.00 0.25 0.16 15.40
sdv 0.00 0.00 0.00 1163.45 0.00 57.23
100.74 0.21 0.18 0.00 0.18 0.13 15.62
sdw 0.00 0.00 0.00 1168.20 0.00 57.23
100.33 0.21 0.18 0.00 0.18 0.14 16.00
sdx 0.00 0.00 0.00 1155.80 0.00 57.23
101.41 0.20 0.18 0.00 0.18 0.14 16.18
sdy 0.00 0.00 3488.30 1157.35 27.25 57.23
37.24 1.03 0.22 0.23 0.20 0.13 62.12
sdu 0.00 0.00 3491.60 1086.10 27.28 57.30
37.84 1.04 0.23 0.23 0.21 0.13 61.28
zd0 0.00 0.00 0.00 46480.65 0.00 181.57
8.00 9.69 0.21 0.00 0.21 0.02 100.00

sda and dm devices are the boot drive and boot volume encryption. There
are not a part of this test.

You can see here that 'zd0' is writing 8 sector writes (4K) at 181MB/sec.
The 24 underlying disks (sdb - sdz) are writing at 57.25 MB/sec each
totaling 1374 MB/sec for a wear amplification of 7.6:1. This particular
test was done with blocksize=4k on raidz3. Setting blocksize down does
seem to help, but I suspect it hurts a lot of other "stuff". Some of the
disks were also doing a lot of "reads" which is depressing for a system
with 64GB of ram running a single 250GB test with q=10.

These tests are for pure ssd (front-line storage). I do not have an L2ARC
or ZIL drive defined.

You can contact me on this group or off-line at:

"first name" at my company name (less the LLC) .com

Doug Dumitru
CTO EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-10-29 20:44:08 UTC
Permalink
> I have been doing some testing on large SSD arrays. The configuration is:
>
> Xeon E5-1650
> 24 128GB "consumer" SSD
> 64GB RAM
>
> This system is fast. At q=200 the drives will do 4K reads at > 800,000
> IOPS. Doing linear writes, a raid-0 array can hit 6GB/sec.
>
> I am testing this with a 300GB target ZVOL using a test program that does
> O_DIRECT aligned random block IO. Compression and de-dupe are both turned
> off.
>
> In general, reads seem to saturate out just under 300,000 IOPS at q=200.
> This is not that surprising and can be attributed to SHA overhead.
>
> My concern is writes. One is that write IOPS are "pedestrian" for this
> size of array at under 80,000 IOPS. 80K IOPS sounds like a lot, but the
> array is capable of a lot more than this. This 80K also is quite variable
> and is often a lot lower, depending on the pre-conditioning of the target
> volume. On a particular test at q=10, random writes proceeded at 17,676
> IOPS or 69.06MB/sec. This was a "100% default" zpool running raid-0.
>
> My bigger concern is what is happening to the drives underneath. During
> this test above, I watched the underlying devices with 'iostat' and they
> were doing 365.27MB/sec of "actual writes" at the drive level. This is a
> "wear amplification" of more than 5:1. For SSDs wear amplification is
> important because it directly leads to flash wear out.

Likely the ashift was automatically set to 13 (8K), which causes each 4K
block to be written as 8K. Make sure to specify ashift=12 when creating
the pool. Also sync writes will by default first be written to the ZIL
(the normal array is used if no ZIL is specified), so that's a doubling of
the writes as well. Set the zvol logbiad property to throughput to disable
this.
>
> Just for fun, i re-ran the above tests with the zpool configured at
> raidz3. With triple parity raid, the wear amplification jumped to 23:1.

Yes, with 4K blocks this is essentially a 4 way mirror, and so will write
4 times the amount of data. You can use striped mirrors for redundancy and
better performance.

>
> This testing implies that ZFS is just not designed for pure SSD array
> usage, at least with large arrays. This array is 3TB, but I have
> customers
> running pure SSD array as large as 48TB and the trend implies that PB SSD
> arrays are just around the corner. I suspect there is some tuning that
> can
> help (setting the block size lower seems to help some), but I would like
> to
> understand more of what is going on before I jump to extreme conclusions
> (although 23:1 wear amplification is pretty extreme). Eventually, my
> testing will become a paper, so if I am off base, I would like to not
> embarrass myself.
>
> Comments on tuning for pure SSD arrays would be very welcome. I have seen
> some posts that imply the same issues have been seen elsewhere. For
> example, someone tried a pair of Fusion-IO cards and was stuck at
> 400GB/sec
> on writes. This setup is larger/faster than a pair of Fusion cards.
> Again, any help would be appreciated.

Increasing zfs_vdev_max_pending fixed the issue for the Fusion-IO cards,
but I'm not sure if that can be set for zvol's.

Niels



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-10-29 21:03:28 UTC
Permalink
On Tuesday, October 29, 2013 1:44:08 PM UTC-7, Niels de Carpentier wrote:
>
> > I have been doing some testing on large SSD arrays. The configuration
> is:
>

[ ... snipped ... ]

> > My bigger concern is what is happening to the drives underneath. During
> > this test above, I watched the underlying devices with 'iostat' and they
> > were doing 365.27MB/sec of "actual writes" at the drive level. This is
> a
> > "wear amplification" of more than 5:1. For SSDs wear amplification is
> > important because it directly leads to flash wear out.
>
> Likely the ashift was automatically set to 13 (8K), which causes each 4K
> block to be written as 8K. Make sure to specify ashift=12 when creating
> the pool. Also sync writes will by default first be written to the ZIL
> (the normal array is used if no ZIL is specified), so that's a doubling of
> the writes as well. Set the zvol logbiad property to throughput to disable
> this.
>

ashift is at 12 (4k). I re-ran the test with logbias set as throughput for
the zvol and it had no impact.


> >
> > Just for fun, i re-ran the above tests with the zpool configured at
> > raidz3. With triple parity raid, the wear amplification jumped to 23:1.
>
> Yes, with 4K blocks this is essentially a 4 way mirror, and so will write
> 4 times the amount of data. You can use striped mirrors for redundancy and
> better performance.
>
> I expect 4x redundancy, but not 4x space usage. This is supposed to be
"parity" so it should be data+3.

>
>
[ ... snipped ... ]

> Comments on tuning for pure SSD arrays would be very welcome. I have
> seen
> > some posts that imply the same issues have been seen elsewhere. For
> > example, someone tried a pair of Fusion-IO cards and was stuck at
> > 400GB/sec
> > on writes. This setup is larger/faster than a pair of Fusion cards.
> > Again, any help would be appreciated.
>
> Increasing zfs_vdev_max_pending fixed the issue for the Fusion-IO cards,
> but I'm not sure if that can be set for zvol's.
>

I set this to 64 (from10) in /sys/modules/zfs/parameters at it "seemed" to
accept a new value. No change in write performance or underlying writes.
Oh well.


> Niels
>

Again, thank you for the reply. I am really trying to understand the
issues here.

Doug Dumitru
EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
aurfalien
2013-10-29 21:13:01 UTC
Permalink
On Oct 29, 2013, at 2:03 PM, Doug Dumitru wrote:

>
> On Tuesday, October 29, 2013 1:44:08 PM UTC-7, Niels de Carpentier wrote:
> > I have been doing some testing on large SSD arrays. The configuration is:
>
> [ ... snipped ... ]
> > My bigger concern is what is happening to the drives underneath. During
> > this test above, I watched the underlying devices with 'iostat' and they
> > were doing 365.27MB/sec of "actual writes" at the drive level. This is a
> > "wear amplification" of more than 5:1. For SSDs wear amplification is
> > important because it directly leads to flash wear out.
>
> Likely the ashift was automatically set to 13 (8K), which causes each 4K
> block to be written as 8K. Make sure to specify ashift=12 when creating
> the pool. Also sync writes will by default first be written to the ZIL
> (the normal array is used if no ZIL is specified), so that's a doubling of
> the writes as well. Set the zvol logbiad property to throughput to disable
> this.
>
> ashift is at 12 (4k). I re-ran the test with logbias set as throughput for the zvol and it had no impact.
>
> >
> > Just for fun, i re-ran the above tests with the zpool configured at
> > raidz3. With triple parity raid, the wear amplification jumped to 23:1.
>
> Yes, with 4K blocks this is essentially a 4 way mirror, and so will write
> 4 times the amount of data. You can use striped mirrors for redundancy and
> better performance.
>
> I expect 4x redundancy, but not 4x space usage. This is supposed to be "parity" so it should be data+3.
>
> >
> [ ... snipped ... ]
>
> > Comments on tuning for pure SSD arrays would be very welcome. I have seen
> > some posts that imply the same issues have been seen elsewhere. For
> > example, someone tried a pair of Fusion-IO cards and was stuck at
> > 400GB/sec
> > on writes. This setup is larger/faster than a pair of Fusion cards.
> > Again, any help would be appreciated.
>
> Increasing zfs_vdev_max_pending fixed the issue for the Fusion-IO cards,
> but I'm not sure if that can be set for zvol's.
>
> I set this to 64 (from10) in /sys/modules/zfs/parameters at it "seemed" to accept a new value. No change in write performance or underlying writes. Oh well.
>
>
> Niels
>
> Again, thank you for the reply. I am really trying to understand the issues here.

Curious, not to dilute the thread but could it be a limit of whatever interface you are using for the SSDs?

I've also noticed a 30% diff in my tests; fio, dd, iozone with simply manipulating power settings in BIOS and the system (whatever distro you are using).

- aurf

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-10-29 22:04:52 UTC
Permalink
[ ... snipped ... ]

Curious, not to dilute the thread but could it be a limit of whatever
interface you are using for the SSDs?



> I've also noticed a 30% diff in my tests; fio, dd, iozone with simply
> manipulating power settings in BIOS and the system (whatever distro you are
> using).
>
> - aurf
>

The system is setup pretty carefully. cpus are set to performance. The
controllers are LSI 8-port SAS with IRQ affinity spread around.

In general, the drives are not at all saturated (see the iostat dump in
previous post). The server is running Ubunti 12.04 with ZFS 0.6.2 from a
stock install.

Doug Dumitru
EasyCo LLC


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-10-29 21:45:43 UTC
Permalink
>
> ashift is at 12 (4k). I re-ran the test with logbias set as throughput
> for
> the zvol and it had no impact.

Weird. Another thing you can try is to set zil_slog_limit very small (I
don't know if 0 is allowed). That should prevent double writing of the
data.
>

>> I expect 4x redundancy, but not 4x space usage. This is supposed to be
> "parity" so it should be data+3.

If you write a 4kB block, it also needs to write 3 parity blocks of 4kB
(ashift = 12 means a minimum blocksize of 4kB). If you set ashift=9, the
overhead will be lower. ( 8 512B data blocks and 3 512B parity blocks). A
large raidz vdev with a high ashift and small block size is not a good
combination.

>
> I set this to 64 (from10) in /sys/modules/zfs/parameters at it "seemed" to
> accept a new value. No change in write performance or underlying writes.
> Oh well.

What is the CPU load during the test?

Niels

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-10-29 22:04:47 UTC
Permalink
On Tue, Oct 29, 2013 at 3:45 PM, Niels de Carpentier
<zfs-zx3GLP/***@public.gmane.org> wrote:
>>
>> ashift is at 12 (4k). I re-ran the test with logbias set as throughput
>> for
>> the zvol and it had no impact.
>
> Weird. Another thing you can try is to set zil_slog_limit very small (I
> don't know if 0 is allowed). That should prevent double writing of the
> data.
>>
>
>>> I expect 4x redundancy, but not 4x space usage. This is supposed to be
>> "parity" so it should be data+3.
>
> If you write a 4kB block, it also needs to write 3 parity blocks of 4kB
> (ashift = 12 means a minimum blocksize of 4kB). If you set ashift=9, the
> overhead will be lower. ( 8 512B data blocks and 3 512B parity blocks). A
> large raidz vdev with a high ashift and small block size is not a good
> combination.

That's where my mind went as well. It reminded me of this thread:

http://osdir.com/ml/zfs-discuss/2013-09/msg00433.html

You can do some zdb digging and probably find that you're burning half
of your disk space in parity. It writes parity *per record* as a
minimum, so if you have 4k vol block size, and 4k ashift, You're going
to write a 4k parity block for each 4k vol block (or 3 parity blocks
per raidz3). If you reduce ashift, you'll write that 4k out in 512b
chunks and your parity blocks can be smaller. If you increase
volblocksize it will work in a similar fashion.

>
>>
>> I set this to 64 (from10) in /sys/modules/zfs/parameters at it "seemed" to
>> accept a new value. No change in write performance or underlying writes.
>> Oh well.
>
> What is the CPU load during the test?
>
> Niels
>
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-10-29 22:12:26 UTC
Permalink
On Tuesday, October 29, 2013 2:45:43 PM UTC-7, Niels de Carpentier wrote:
>
> >
> > ashift is at 12 (4k). I re-ran the test with logbias set as throughput
> > for
> > the zvol and it had no impact.
>
> Weird. Another thing you can try is to set zil_slog_limit very small (I
> don't know if 0 is allowed). That should prevent double writing of the
> data.
>

The parameter was accepted in /sys/.../parameters but the test remained the
same.


> >
>
> >> I expect 4x redundancy, but not 4x space usage. This is supposed to be
> > "parity" so it should be data+3.
>
> If you write a 4kB block, it also needs to write 3 parity blocks of 4kB
> (ashift = 12 means a minimum blocksize of 4kB). If you set ashift=9, the
> overhead will be lower. ( 8 512B data blocks and 3 512B parity blocks). A
> large raidz vdev with a high ashift and small block size is not a good
> combination.
>
> >
> > I set this to 64 (from10) in /sys/modules/zfs/parameters at it "seemed"
> to
> > accept a new value. No change in write performance or underlying
> writes.
> > Oh well.
>
> What is the CPU load during the test?
>
> Pretty heavy, but no single thread 100% saturated. Here is a 10 second
'htop'


0 [||||||||||||||||||||||||||||||||||||||| 68.6%] 6
[|||||||||||||||||||||||||||| 46.9%]
1 [||||||||||||||||||||||||||||||||||||| 64.8%] 7
[||||||||||||||||||||||||||||| 48.6%]
2 [||||||||||||||||||||||||||||||||||||||| 67.7%] 8
[||||||||||||||||||||||||| 43.3%]
3 [|||||||||||||||||||||||||||||||||||||| 65.3%] 9
[|||||||||||||||||||||||||| 43.6%]
4 [||||||||||||||||||||||||||||||||| 57.4%] 10
[|||||||||||||||||||||||||||| 48.6%]
5 [||||||||||||||||||||||||||||||| 53.2%] 11
[||||||||||||||||||||||||||||| 48.7%]
Mem[|||||||||||||||||||||||||||||||||||||||||||56710/64401MB] Tasks:
44, 4 thr, 269 kthr; 16 running
Swp[ 0/647MB] Load
average: 14.51 3.33 1.14
Uptime:
21:12:32

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
16578 root 0 -20 0 0 0 D 34.0 0.0 3:05.83 txg_sync
16474 root 39 19 0 0 0 S 21.0 0.0 2:22.25 z_wr_iss/4
16481 root 39 19 0 0 0 S 20.0 0.0 2:32.68 z_wr_iss/11
16477 root 39 19 0 0 0 S 19.0 0.0 2:30.06 z_wr_iss/7
16480 root 39 19 0 0 0 S 18.0 0.0 2:35.65 z_wr_iss/10
16478 root 39 19 0 0 0 S 17.0 0.0 2:30.42 z_wr_iss/8
16479 root 39 19 0 0 0 S 15.0 0.0 2:33.96 z_wr_iss/9
16496 root 39 19 0 0 0 D 15.0 0.0 1:46.70 z_wr_int/9
16497 root 39 19 0 0 0 D 15.0 0.0 1:46.01 z_wr_int/10
16500 root 39 19 0 0 0 R 15.0 0.0 1:22.57 z_wr_int/13
16472 root 39 19 0 0 0 S 14.0 0.0 1:58.21 z_wr_iss/2
16475 root 39 19 0 0 0 S 14.0 0.0 2:21.05 z_wr_iss/5
16490 root 39 19 0 0 0 R 14.0 0.0 1:21.04 z_wr_int/3
16493 root 39 19 0 0 0 R 14.0 0.0 1:47.74 z_wr_int/6
16495 root 39 19 0 0 0 D 14.0 0.0 1:52.40 z_wr_int/8
16498 root 39 19 0 0 0 D 14.0 0.0 1:43.26 z_wr_int/11
16476 root 39 19 0 0 0 S 13.0 0.0 2:32.47 z_wr_iss/6
16488 root 39 19 0 0 0 D 13.0 0.0 1:22.40 z_wr_int/1
16492 root 39 19 0 0 0 D 13.0 0.0 1:40.57 z_wr_int/5
16494 root 39 19 0 0 0 R 13.0 0.0 1:50.35 z_wr_int/7
16471 root 39 19 0 0 0 S 12.0 0.0 1:57.12 z_wr_iss/1
16473 root 39 19 0 0 0 S 12.0 0.0 2:00.79 z_wr_iss/3
16489 root 39 19 0 0 0 D 12.0 0.0 1:22.78 z_wr_int/2
16499 root 39 19 0 0 0 R 12.0 0.0 1:14.59 z_wr_int/12
16501 root 39 19 0 0 0 D 12.0 0.0 1:23.20 z_wr_int/14
16502 root 39 19 0 0 0 R 12.0 0.0 1:21.28 z_wr_int/15
16487 root 39 19 0 0 0 D 11.0 0.0 1:15.60 z_wr_int/0
16470 root 39 19 0 0 0 S 10.0 0.0 1:58.60 z_wr_iss/0
16491 root 39 19 0 0 0 R 10.0 0.0 1:38.90 z_wr_int/4
13920 root 39 19 0 0 0 S 6.0 0.0 2:30.65
spl_kmem_cache/
13942 root 39 19 0 0 0 D 5.0 0.0 0:28.65 zvol/6
13943 root 39 19 0 0 0 S 5.0 0.0 0:28.46 zvol/7
13944 root 39 19 0 0 0 D 5.0 0.0 0:30.19 zvol/8
13945 root 39 19 0 0 0 D 5.0 0.0 0:30.01 zvol/9
13947 root 39 19 0 0 0 S 5.0 0.0 0:29.42 zvol/11
13956 root 39 19 0 0 0 S 5.0 0.0 0:30.21 zvol/20
13957 root 39 19 0 0 0 S 5.0 0.0 0:29.58 zvol/21

The calling benchmark program is not even in the list.

Niels
>
>
Doug Dumitru

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Daniel Brooks
2013-10-29 21:36:21 UTC
Permalink
What do you mean by 'raid-0' when you're talking about ZFS? Do you really
mean 24 top-level vdevs each consisting of 1 drive?

Regardless, write amplification is to be expected; remember that each block
in your vdev is a leaf in a Merkle tree, and that each change to a block
requires writing new versions of all of the ancestor blocks as well.


On Tue, Oct 29, 2013 at 12:51 PM, Doug Dumitru <
dougdumitruredirect-***@public.gmane.org> wrote:

> I have been doing some testing on large SSD arrays. The configuration is:
>
> Xeon E5-1650
> 24 128GB "consumer" SSD
> 64GB RAM
>
> This system is fast. At q=200 the drives will do 4K reads at > 800,000
> IOPS. Doing linear writes, a raid-0 array can hit 6GB/sec.
>
> I am testing this with a 300GB target ZVOL using a test program that does
> O_DIRECT aligned random block IO. Compression and de-dupe are both turned
> off.
>
> In general, reads seem to saturate out just under 300,000 IOPS at q=200.
> This is not that surprising and can be attributed to SHA overhead.
>
> My concern is writes. One is that write IOPS are "pedestrian" for this
> size of array at under 80,000 IOPS. 80K IOPS sounds like a lot, but the
> array is capable of a lot more than this. This 80K also is quite variable
> and is often a lot lower, depending on the pre-conditioning of the target
> volume. On a particular test at q=10, random writes proceeded at 17,676
> IOPS or 69.06MB/sec. This was a "100% default" zpool running raid-0.
>
> My bigger concern is what is happening to the drives underneath. During
> this test above, I watched the underlying devices with 'iostat' and they
> were doing 365.27MB/sec of "actual writes" at the drive level. This is a
> "wear amplification" of more than 5:1. For SSDs wear amplification is
> important because it directly leads to flash wear out.
>
> Just for fun, i re-ran the above tests with the zpool configured at
> raidz3. With triple parity raid, the wear amplification jumped to 23:1.
>
> This testing implies that ZFS is just not designed for pure SSD array
> usage, at least with large arrays. This array is 3TB, but I have customers
> running pure SSD array as large as 48TB and the trend implies that PB SSD
> arrays are just around the corner. I suspect there is some tuning that can
> help (setting the block size lower seems to help some), but I would like to
> understand more of what is going on before I jump to extreme conclusions
> (although 23:1 wear amplification is pretty extreme). Eventually, my
> testing will become a paper, so if I am off base, I would like to not
> embarrass myself.
>
> Comments on tuning for pure SSD arrays would be very welcome. I have seen
> some posts that imply the same issues have been seen elsewhere. For
> example, someone tried a pair of Fusion-IO cards and was stuck at 400GB/sec
> on writes. This setup is larger/faster than a pair of Fusion cards.
> Again, any help would be appreciated.
>
> To show the "write amplification", here is a 20 second snapshot from
> 'iostat -x 20 -m':
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-1 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdb 0.00 0.00 0.00 1172.30 0.00 57.27
> 100.04 0.19 0.16 0.00 0.16 0.13 14.76
> sdd 0.00 0.00 0.00 1162.20 0.00 57.27
> 100.91 0.20 0.17 0.00 0.17 0.13 15.44
> sde 0.00 0.00 3486.90 1159.95 27.24 57.27
> 37.25 1.04 0.22 0.23 0.20 0.13 62.50
> sdg 0.00 0.00 0.00 1167.25 0.00 57.26
> 100.46 0.20 0.17 0.00 0.17 0.14 15.86
> sdf 0.00 0.00 0.00 1170.75 0.00 57.26
> 100.16 0.19 0.16 0.00 0.16 0.13 15.56
> sdi 0.00 0.00 3482.65 1155.95 27.21 57.26
> 37.29 1.05 0.23 0.24 0.20 0.13 62.00
> sdc 0.00 0.00 0.00 1174.15 0.00 57.27
> 99.89 0.19 0.16 0.00 0.16 0.13 15.02
> sdh 0.00 0.00 0.00 1170.00 0.00 57.26
> 100.22 0.20 0.17 0.00 0.17 0.13 15.62
> sdj 0.00 0.00 0.00 1144.60 0.00 57.24
> 102.41 0.31 0.27 0.00 0.27 0.15 17.46
> sdl 0.00 0.00 0.00 1136.80 0.00 57.24
> 103.11 0.44 0.38 0.00 0.38 0.16 18.40
> sdp 0.00 0.00 0.00 1140.95 0.00 57.25
> 102.77 0.47 0.41 0.00 0.41 0.17 18.96
> sdq 0.00 0.00 3488.40 1091.05 27.25 57.25
> 37.79 1.32 0.29 0.30 0.25 0.14 62.76
> sdk 0.00 0.00 0.00 1138.95 0.00 57.24
> 102.92 0.39 0.35 0.00 0.35 0.16 18.74
> sdm 0.00 0.00 3488.50 968.00 27.25 57.24
> 38.83 1.32 0.30 0.30 0.28 0.14 62.98
> sdn 0.00 0.00 0.00 1150.80 0.00 57.25
> 101.89 0.53 0.46 0.00 0.46 0.17 19.12
> sdo 0.00 0.00 0.00 1138.25 0.00 57.25
> 103.02 0.44 0.38 0.00 0.38 0.16 18.04
> sdr 0.00 0.00 0.00 1168.45 0.00 57.30
> 100.43 0.19 0.17 0.00 0.17 0.12 14.58
> sds 0.00 0.00 0.00 1170.05 0.00 57.30
> 100.29 0.21 0.18 0.00 0.18 0.14 15.98
> sdt 0.00 0.00 0.00 949.10 0.00 57.30
> 123.64 0.23 0.25 0.00 0.25 0.16 15.40
> sdv 0.00 0.00 0.00 1163.45 0.00 57.23
> 100.74 0.21 0.18 0.00 0.18 0.13 15.62
> sdw 0.00 0.00 0.00 1168.20 0.00 57.23
> 100.33 0.21 0.18 0.00 0.18 0.14 16.00
> sdx 0.00 0.00 0.00 1155.80 0.00 57.23
> 101.41 0.20 0.18 0.00 0.18 0.14 16.18
> sdy 0.00 0.00 3488.30 1157.35 27.25 57.23
> 37.24 1.03 0.22 0.23 0.20 0.13 62.12
> sdu 0.00 0.00 3491.60 1086.10 27.28 57.30
> 37.84 1.04 0.23 0.23 0.21 0.13 61.28
> zd0 0.00 0.00 0.00 46480.65 0.00 181.57
> 8.00 9.69 0.21 0.00 0.21 0.02 100.00
>
> sda and dm devices are the boot drive and boot volume encryption. There
> are not a part of this test.
>
> You can see here that 'zd0' is writing 8 sector writes (4K) at 181MB/sec.
> The 24 underlying disks (sdb - sdz) are writing at 57.25 MB/sec each
> totaling 1374 MB/sec for a wear amplification of 7.6:1. This particular
> test was done with blocksize=4k on raidz3. Setting blocksize down does
> seem to help, but I suspect it hurts a lot of other "stuff". Some of the
> disks were also doing a lot of "reads" which is depressing for a system
> with 64GB of ram running a single 250GB test with q=10.
>
> These tests are for pure ssd (front-line storage). I do not have an L2ARC
> or ZIL drive defined.
>
> You can contact me on this group or off-line at:
>
> "first name" at my company name (less the LLC) .com
>
> Doug Dumitru
> CTO EasyCo LLC
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-10-29 22:07:50 UTC
Permalink
On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>
> What do you mean by 'raid-0' when you're talking about ZFS? Do you really
> mean 24 top-level vdevs each consisting of 1 drive?
>
> Regardless, write amplification is to be expected; remember that each
> block in your vdev is a leaf in a Merkle tree, and that each change to a
> block requires writing new versions of all of the ancestor blocks as well.
>
> The setup is just create a zpool with 24 backing drives, ie:

zpool create data /dev/sd[b-y] -f

no partitions, just bare, recently secure erased SSDs.

If the tree branch is being re-written, how deep does this go. It looks
like past write patterns impact this (which makes sense). This look a lot
like guessing at an FTL from the outside in.

Doug Dumitru
EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Daniel Brooks
2013-10-30 01:30:30 UTC
Permalink
Ouch, so you really were using a raid-0 style pool, with no redundancy at
all (ZFS will be able to detect errors but not correct them). You probably
want mirror pairs, like so: zpool create data mirror sdb sdc mirror sdd sde
mirror sdf sdg... etc. You could also go with what you have, but set the
copies property on the zpool to 2 (or more). This will make ZFS store extra
copies of all the data, allowing it to correct errors. It wouldn't be quite
as fast as mirrored pairs though.

So the data in a ZFS pool is a Merkle tree, which is an elegant way of
maintaining provable data integrity. Each block written to disk is hashed,
and the hash value is put into the parent block. The parent block is also
hashed, and its hash value is put in its parent block, and so on. This
means that if the content of a block changes, it's hash changes and
therefore the content of the parent block must also change. Changes
therefore propagate recursively up the tree to the uberblock, which
contains a single hash value that effectively represents the entire pool
with all of the filesystems and zvols it contains. Because ZFS is a
copy-on-write filesystem, these blocks are not modified in place; instead
new ones are created and written to unused portions of the disk. This has
the benefit of serializing what would otherwise be a huge set of random
writes into a single contiguous write, so while you don't pay an IOPS
penalty you do end up writing more than 4k of data for every 4k zvol block
(redundancy acts on top of that). As usual, the designers of ZFS have opted
for safety over efficiency.

If you're particularly concerned about write amplification (and it's fair
that you would be), then you could play with the nesting depth of your
zvols. For example, if you put a zvol in as data/foo/bar/baz/vol1, then
foo, bar, and baz are all zfs filesystems with their own nested nodes in
the tree. This will be marginally more expensive than a zvol located at
data/vol2.


On Tue, Oct 29, 2013 at 3:07 PM, Doug Dumitru <dougdumitruredirect-***@public.gmane.org
> wrote:

>
>
> On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>>
>> What do you mean by 'raid-0' when you're talking about ZFS? Do you really
>> mean 24 top-level vdevs each consisting of 1 drive?
>>
>> Regardless, write amplification is to be expected; remember that each
>> block in your vdev is a leaf in a Merkle tree, and that each change to a
>> block requires writing new versions of all of the ancestor blocks as well.
>>
>> The setup is just create a zpool with 24 backing drives, ie:
>
> zpool create data /dev/sd[b-y] -f
>
> no partitions, just bare, recently secure erased SSDs.
>
> If the tree branch is being re-written, how deep does this go. It looks
> like past write patterns impact this (which makes sense). This look a lot
> like guessing at an FTL from the outside in.
>
> Doug Dumitru
> EasyCo LLC
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
s***@public.gmane.org
2013-10-30 21:13:31 UTC
Permalink
On what level is the recursion within the Merkle tree done?

I mean is it only on datasets/zvols or each direcotory in dataset has its
own hash for all hashes within the directory.

If I have only one dataset within the pool, does this means:
1. that there are only hashes of blocks and one hash for dataset and one
for the pool?
or
2. each directory in the dataset has its own hash, so when I have a file
that is under 1000 directories, the hash needs to be recalculated 1002
times?


Is it possible to turn this off, If I dont want so super safe storage? So
there will be only hashes for each block.

Thank you



On Wednesday, October 30, 2013 2:30:30 AM UTC+1, db48x wrote:
>
> Ouch, so you really were using a raid-0 style pool, with no redundancy at
> all (ZFS will be able to detect errors but not correct them). You probably
> want mirror pairs, like so: zpool create data mirror sdb sdc mirror sdd sde
> mirror sdf sdg... etc. You could also go with what you have, but set the
> copies property on the zpool to 2 (or more). This will make ZFS store extra
> copies of all the data, allowing it to correct errors. It wouldn't be quite
> as fast as mirrored pairs though.
>
> So the data in a ZFS pool is a Merkle tree, which is an elegant way of
> maintaining provable data integrity. Each block written to disk is hashed,
> and the hash value is put into the parent block. The parent block is also
> hashed, and its hash value is put in its parent block, and so on. This
> means that if the content of a block changes, it's hash changes and
> therefore the content of the parent block must also change. Changes
> therefore propagate recursively up the tree to the uberblock, which
> contains a single hash value that effectively represents the entire pool
> with all of the filesystems and zvols it contains. Because ZFS is a
> copy-on-write filesystem, these blocks are not modified in place; instead
> new ones are created and written to unused portions of the disk. This has
> the benefit of serializing what would otherwise be a huge set of random
> writes into a single contiguous write, so while you don't pay an IOPS
> penalty you do end up writing more than 4k of data for every 4k zvol block
> (redundancy acts on top of that). As usual, the designers of ZFS have opted
> for safety over efficiency.
>
> If you're particularly concerned about write amplification (and it's fair
> that you would be), then you could play with the nesting depth of your
> zvols. For example, if you put a zvol in as data/foo/bar/baz/vol1, then
> foo, bar, and baz are all zfs filesystems with their own nested nodes in
> the tree. This will be marginally more expensive than a zvol located at
> data/vol2.
>
>
> On Tue, Oct 29, 2013 at 3:07 PM, Doug Dumitru <dougdumit...-***@public.gmane.org<javascript:>
> > wrote:
>
>>
>>
>> On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>>>
>>> What do you mean by 'raid-0' when you're talking about ZFS? Do you
>>> really mean 24 top-level vdevs each consisting of 1 drive?
>>>
>>> Regardless, write amplification is to be expected; remember that each
>>> block in your vdev is a leaf in a Merkle tree, and that each change to a
>>> block requires writing new versions of all of the ancestor blocks as well.
>>>
>>> The setup is just create a zpool with 24 backing drives, ie:
>>
>> zpool create data /dev/sd[b-y] -f
>>
>> no partitions, just bare, recently secure erased SSDs.
>>
>> If the tree branch is being re-written, how deep does this go. It looks
>> like past write patterns impact this (which makes sense). This look a lot
>> like guessing at an FTL from the outside in.
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>
>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
aurfalien
2013-10-30 21:24:53 UTC
Permalink
On Oct 30, 2013, at 2:13 PM, szabo2424-***@public.gmane.org wrote:

> On what level is the recursion within the Merkle tree done?
>
> I mean is it only on datasets/zvols or each direcotory in dataset has its own hash for all hashes within the directory.
>
> If I have only one dataset within the pool, does this means:
> 1. that there are only hashes of blocks and one hash for dataset and one for the pool?
> or
> 2. each directory in the dataset has its own hash, so when I have a file that is under 1000 directories, the hash needs to be recalculated 1002 times?
>
>
> Is it possible to turn this off, If I dont want so super safe storage? So there will be only hashes for each block.

Is the the same as disabling checksums?

Since you have Merkle, I suppose why checksum as well?

- aurf

>
> Thank you
>
>
>
> On Wednesday, October 30, 2013 2:30:30 AM UTC+1, db48x wrote:
> Ouch, so you really were using a raid-0 style pool, with no redundancy at all (ZFS will be able to detect errors but not correct them). You probably want mirror pairs, like so: zpool create data mirror sdb sdc mirror sdd sde mirror sdf sdg... etc. You could also go with what you have, but set the copies property on the zpool to 2 (or more). This will make ZFS store extra copies of all the data, allowing it to correct errors. It wouldn't be quite as fast as mirrored pairs though.
>
> So the data in a ZFS pool is a Merkle tree, which is an elegant way of maintaining provable data integrity. Each block written to disk is hashed, and the hash value is put into the parent block. The parent block is also hashed, and its hash value is put in its parent block, and so on. This means that if the content of a block changes, it's hash changes and therefore the content of the parent block must also change. Changes therefore propagate recursively up the tree to the uberblock, which contains a single hash value that effectively represents the entire pool with all of the filesystems and zvols it contains. Because ZFS is a copy-on-write filesystem, these blocks are not modified in place; instead new ones are created and written to unused portions of the disk. This has the benefit of serializing what would otherwise be a huge set of random writes into a single contiguous write, so while you don't pay an IOPS penalty you do end up writing more than 4k of data for every 4k zvol block (redundancy acts on top of that). As usual, the designers of ZFS have opted for safety over efficiency.
>
> If you're particularly concerned about write amplification (and it's fair that you would be), then you could play with the nesting depth of your zvols. For example, if you put a zvol in as data/foo/bar/baz/vol1, then foo, bar, and baz are all zfs filesystems with their own nested nodes in the tree. This will be marginally more expensive than a zvol located at data/vol2.
>
>
> On Tue, Oct 29, 2013 at 3:07 PM, Doug Dumitru <dougdumit...-***@public.gmane.org> wrote:
>
>
> On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
> What do you mean by 'raid-0' when you're talking about ZFS? Do you really mean 24 top-level vdevs each consisting of 1 drive?
>
> Regardless, write amplification is to be expected; remember that each block in your vdev is a leaf in a Merkle tree, and that each change to a block requires writing new versions of all of the ancestor blocks as well.
>
> The setup is just create a zpool with 24 backing drives, ie:
>
> zpool create data /dev/sd[b-y] -f
>
> no partitions, just bare, recently secure erased SSDs.
>
> If the tree branch is being re-written, how deep does this go. It looks like past write patterns impact this (which makes sense). This look a lot like guessing at an FTL from the outside in.
>
> Doug Dumitru
> EasyCo LLC
>
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>
>
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Daniel Brooks
2013-10-30 23:12:51 UTC
Permalink
Each directory is an entry in the Merkle tree; if you had to hash the
entire filesystem on every write it would be very slow. In fact, wouldn't
be surprised if big directories were stored as balanced trees instead of
flat lists, to reduce access and update times. You'll have to find some
technical documentation or play around with zdb and see.
On Oct 30, 2013 2:13 PM, <szabo2424-***@public.gmane.org> wrote:

> On what level is the recursion within the Merkle tree done?
>
> I mean is it only on datasets/zvols or each direcotory in dataset has its
> own hash for all hashes within the directory.
>
> If I have only one dataset within the pool, does this means:
> 1. that there are only hashes of blocks and one hash for dataset and one
> for the pool?
> or
> 2. each directory in the dataset has its own hash, so when I have a file
> that is under 1000 directories, the hash needs to be recalculated 1002
> times?
>
>
> Is it possible to turn this off, If I dont want so super safe storage? So
> there will be only hashes for each block.
>
> Thank you
>
>
>
> On Wednesday, October 30, 2013 2:30:30 AM UTC+1, db48x wrote:
>>
>> Ouch, so you really were using a raid-0 style pool, with no redundancy at
>> all (ZFS will be able to detect errors but not correct them). You probably
>> want mirror pairs, like so: zpool create data mirror sdb sdc mirror sdd sde
>> mirror sdf sdg... etc. You could also go with what you have, but set the
>> copies property on the zpool to 2 (or more). This will make ZFS store extra
>> copies of all the data, allowing it to correct errors. It wouldn't be quite
>> as fast as mirrored pairs though.
>>
>> So the data in a ZFS pool is a Merkle tree, which is an elegant way of
>> maintaining provable data integrity. Each block written to disk is hashed,
>> and the hash value is put into the parent block. The parent block is also
>> hashed, and its hash value is put in its parent block, and so on. This
>> means that if the content of a block changes, it's hash changes and
>> therefore the content of the parent block must also change. Changes
>> therefore propagate recursively up the tree to the uberblock, which
>> contains a single hash value that effectively represents the entire pool
>> with all of the filesystems and zvols it contains. Because ZFS is a
>> copy-on-write filesystem, these blocks are not modified in place; instead
>> new ones are created and written to unused portions of the disk. This has
>> the benefit of serializing what would otherwise be a huge set of random
>> writes into a single contiguous write, so while you don't pay an IOPS
>> penalty you do end up writing more than 4k of data for every 4k zvol block
>> (redundancy acts on top of that). As usual, the designers of ZFS have opted
>> for safety over efficiency.
>>
>> If you're particularly concerned about write amplification (and it's fair
>> that you would be), then you could play with the nesting depth of your
>> zvols. For example, if you put a zvol in as data/foo/bar/baz/vol1, then
>> foo, bar, and baz are all zfs filesystems with their own nested nodes in
>> the tree. This will be marginally more expensive than a zvol located at
>> data/vol2.
>>
>>
>> On Tue, Oct 29, 2013 at 3:07 PM, Doug Dumitru <dougdumit...-***@public.gmane.org**>wrote:
>>
>>>
>>>
>>> On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>>>>
>>>> What do you mean by 'raid-0' when you're talking about ZFS? Do you
>>>> really mean 24 top-level vdevs each consisting of 1 drive?
>>>>
>>>> Regardless, write amplification is to be expected; remember that each
>>>> block in your vdev is a leaf in a Merkle tree, and that each change to a
>>>> block requires writing new versions of all of the ancestor blocks as well.
>>>>
>>>> The setup is just create a zpool with 24 backing drives, ie:
>>>
>>> zpool create data /dev/sd[b-y] -f
>>>
>>> no partitions, just bare, recently secure erased SSDs.
>>>
>>> If the tree branch is being re-written, how deep does this go. It looks
>>> like past write patterns impact this (which makes sense). This look a lot
>>> like guessing at an FTL from the outside in.
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...@**zfsonlinux.org.
>>>
>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Daniel Brooks
2013-10-30 23:15:27 UTC
Permalink
And no, it's not possible to turn it off; merkle trees are the fundamental
organizing principle of ZFS. If you don't care about the safety of your
data then use a different filesystem.
On Oct 30, 2013 2:13 PM, <szabo2424-***@public.gmane.org> wrote:

> On what level is the recursion within the Merkle tree done?
>
> I mean is it only on datasets/zvols or each direcotory in dataset has its
> own hash for all hashes within the directory.
>
> If I have only one dataset within the pool, does this means:
> 1. that there are only hashes of blocks and one hash for dataset and one
> for the pool?
> or
> 2. each directory in the dataset has its own hash, so when I have a file
> that is under 1000 directories, the hash needs to be recalculated 1002
> times?
>
>
> Is it possible to turn this off, If I dont want so super safe storage? So
> there will be only hashes for each block.
>
> Thank you
>
>
>
> On Wednesday, October 30, 2013 2:30:30 AM UTC+1, db48x wrote:
>>
>> Ouch, so you really were using a raid-0 style pool, with no redundancy at
>> all (ZFS will be able to detect errors but not correct them). You probably
>> want mirror pairs, like so: zpool create data mirror sdb sdc mirror sdd sde
>> mirror sdf sdg... etc. You could also go with what you have, but set the
>> copies property on the zpool to 2 (or more). This will make ZFS store extra
>> copies of all the data, allowing it to correct errors. It wouldn't be quite
>> as fast as mirrored pairs though.
>>
>> So the data in a ZFS pool is a Merkle tree, which is an elegant way of
>> maintaining provable data integrity. Each block written to disk is hashed,
>> and the hash value is put into the parent block. The parent block is also
>> hashed, and its hash value is put in its parent block, and so on. This
>> means that if the content of a block changes, it's hash changes and
>> therefore the content of the parent block must also change. Changes
>> therefore propagate recursively up the tree to the uberblock, which
>> contains a single hash value that effectively represents the entire pool
>> with all of the filesystems and zvols it contains. Because ZFS is a
>> copy-on-write filesystem, these blocks are not modified in place; instead
>> new ones are created and written to unused portions of the disk. This has
>> the benefit of serializing what would otherwise be a huge set of random
>> writes into a single contiguous write, so while you don't pay an IOPS
>> penalty you do end up writing more than 4k of data for every 4k zvol block
>> (redundancy acts on top of that). As usual, the designers of ZFS have opted
>> for safety over efficiency.
>>
>> If you're particularly concerned about write amplification (and it's fair
>> that you would be), then you could play with the nesting depth of your
>> zvols. For example, if you put a zvol in as data/foo/bar/baz/vol1, then
>> foo, bar, and baz are all zfs filesystems with their own nested nodes in
>> the tree. This will be marginally more expensive than a zvol located at
>> data/vol2.
>>
>>
>> On Tue, Oct 29, 2013 at 3:07 PM, Doug Dumitru <dougdumit...-***@public.gmane.org**>wrote:
>>
>>>
>>>
>>> On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>>>>
>>>> What do you mean by 'raid-0' when you're talking about ZFS? Do you
>>>> really mean 24 top-level vdevs each consisting of 1 drive?
>>>>
>>>> Regardless, write amplification is to be expected; remember that each
>>>> block in your vdev is a leaf in a Merkle tree, and that each change to a
>>>> block requires writing new versions of all of the ancestor blocks as well.
>>>>
>>>> The setup is just create a zpool with 24 backing drives, ie:
>>>
>>> zpool create data /dev/sd[b-y] -f
>>>
>>> no partitions, just bare, recently secure erased SSDs.
>>>
>>> If the tree branch is being re-written, how deep does this go. It looks
>>> like past write patterns impact this (which makes sense). This look a lot
>>> like guessing at an FTL from the outside in.
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...@**zfsonlinux.org.
>>>
>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-10-31 10:44:34 UTC
Permalink
>
>
> On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>>
>> What do you mean by 'raid-0' when you're talking about ZFS? Do you
>> really
>> mean 24 top-level vdevs each consisting of 1 drive?
>>
>> Regardless, write amplification is to be expected; remember that each
>> block in your vdev is a leaf in a Merkle tree, and that each change to a
>> block requires writing new versions of all of the ancestor blocks as
>> well.
>>
>> The setup is just create a zpool with 24 backing drives, ie:
>
> zpool create data /dev/sd[b-y] -f

I'm missing -o shift=12 here. Did you use that when testing?

Niels



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-10-31 17:30:16 UTC
Permalink
On Thursday, October 31, 2013 3:44:34 AM UTC-7, Niels de Carpentier wrote:
>
> >
> >
> > On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
> >>
> >> What do you mean by 'raid-0' when you're talking about ZFS? Do you
> >> really
> >> mean 24 top-level vdevs each consisting of 1 drive?
> >>
> >> Regardless, write amplification is to be expected; remember that each
> >> block in your vdev is a leaf in a Merkle tree, and that each change to
> a
> >> block requires writing new versions of all of the ancestor blocks as
> >> well.
> >>
> >> The setup is just create a zpool with 24 backing drives, ie:
> >
> > zpool create data /dev/sd[b-y] -f
>
> I'm missing -o shift=12 here. Did you use that when testing?
>

I have tried this with ashift=9 and ashift=12. Modern SSDs tend to be 4K
(and sometimes 8K internally), but the ZFS write streams are not as bad as
a true random workload.

ashift=9 seems to amplify a little less than ashift=12, but raid-z
performance drops off a lot more. 4K random writes into a zvol are just
under 80K IOPS but drop below 30K IOPS with raid-z1 and ashift=512. The
wear amp goes up, but there is more "state" going on here.

My tests are very dependent on the "state of the volume". Sometimes I get
"hot drives" where a few drives are getting a lot more writes than the rest
of the group. This only happens with raid and is most common with
raid-z3. The scary thing about this is that with flash, it is the last
thing you want to see.

Again, back to "state", the order of tests impacts wear. I suspect this is
not drive conditioning issues as the drives are never very busy. With 24
drives, I have never seen the drive %util above 30% except on pure random
reads. I have seen wear amp as high as 23:1 (which is really bad) and as
low as 1.7:1 for 4K random writes. 64K random writes are much better
(which would be expected).

My intent here is perhaps not 100% pure. My company has commercial
software specifically designed for SSDs. No where near the feature set of
ZFS (not even trying), but some of the issues like data integrity are
addressed (although with a completely different method). We focus on wear
and 4K random writes and can reach 1M IOPS with de-dupe turned on with this
array. But my intention is not to produce an advert. We just see lots of
ZFS "talk" and wanted to see how well it maps to SSDs. It looks like SSDs
were not the design target and this has led to some trade-offs. This is
not bad, just how it worked out. I am somewhat surprised that in all my
"google searching" I have never seen any mention of wear amplification in
discussions. It seems that the design implies that it has to exist, but
that no-one have ever measured it. The trend in flash is to go to lower
and lower geometries which in turn leads to lower and lower endurance.
Thus the concern with wear. Everything in ZFS seems to point to flash on
the L2ARC and ZIL device but with HDDs still running the main store. If
ZFS needs high endurance media because of wear amplification, this hybrid
approach may remain the norm.

Again, too much advert. If you want to discuss what I am doing more,
please contact me off-group. If you have any suggestions for my testing
with ZFS, discussions here are appropriate.

Doug Dumitru
EasyCo LLC

>
> Niels
>
>
>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andreas Dilger
2013-10-31 17:41:50 UTC
Permalink
I think a major part of the problem here may that ZFS will normally use 128kB blocksize for large files. I know it is possible to tune this for a ZVOL but is it possible to tune it for a regular file? If you have a 4kB random write workload then 128kB blocks will be a killer.

I'm not sure if there is a way to tell what the chosen blocksize is for an individual file. Is that reported on a per-file basis by stat() on the file in st_blksize?

Cheers, Andreas

On 2013-10-31, at 11:30, Doug Dumitru <dougdumitruredirect-***@public.gmane.org> wrote:

> On Thursday, October 31, 2013 3:44:34 AM UTC-7, Niels de Carpentier wrote:
>>
>> >
>> >
>> > On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>> >>
>> >> What do you mean by 'raid-0' when you're talking about ZFS? Do you
>> >> really
>> >> mean 24 top-level vdevs each consisting of 1 drive?
>> >>
>> >> Regardless, write amplification is to be expected; remember that each
>> >> block in your vdev is a leaf in a Merkle tree, and that each change to a
>> >> block requires writing new versions of all of the ancestor blocks as
>> >> well.
>> >>
>> >> The setup is just create a zpool with 24 backing drives, ie:
>> >
>> > zpool create data /dev/sd[b-y] -f
>>
>> I'm missing -o shift=12 here. Did you use that when testing?
>
> I have tried this with ashift=9 and ashift=12. Modern SSDs tend to be 4K (and sometimes 8K internally), but the ZFS write streams are not as bad as a true random workload.
>
> ashift=9 seems to amplify a little less than ashift=12, but raid-z performance drops off a lot more. 4K random writes into a zvol are just under 80K IOPS but drop below 30K IOPS with raid-z1 and ashift=512. The wear amp goes up, but there is more "state" going on here.
>
> My tests are very dependent on the "state of the volume". Sometimes I get "hot drives" where a few drives are getting a lot more writes than the rest of the group. This only happens with raid and is most common with raid-z3. The scary thing about this is that with flash, it is the last thing you want to see.
>
> Again, back to "state", the order of tests impacts wear. I suspect this is not drive conditioning issues as the drives are never very busy. With 24 drives, I have never seen the drive %util above 30% except on pure random reads. I have seen wear amp as high as 23:1 (which is really bad) and as low as 1.7:1 for 4K random writes. 64K random writes are much better (which would be expected).
>
> My intent here is perhaps not 100% pure. My company has commercial software specifically designed for SSDs. No where near the feature set of ZFS (not even trying), but some of the issues like data integrity are addressed (although with a completely different method). We focus on wear and 4K random writes and can reach 1M IOPS with de-dupe turned on with this array. But my intention is not to produce an advert. We just see lots of ZFS "talk" and wanted to see how well it maps to SSDs. It looks like SSDs were not the design target and this has led to some trade-offs. This is not bad, just how it worked out. I am somewhat surprised that in all my "google searching" I have never seen any mention of wear amplification in discussions. It seems that the design implies that it has to exist, but that no-one have ever measured it. The trend in flash is to go to lower and lower geometries which in turn leads to lower and lower endurance. Thus the concern with wear. Everything in ZFS seems to point to flash on the L2ARC and ZIL device but with HDDs still running the main store. If ZFS needs high endurance media because of wear amplification, this hybrid approach may remain the norm.
>
> Again, too much advert. If you want to discuss what I am doing more, please contact me off-group. If you have any suggestions for my testing with ZFS, discussions here are appropriate.
>
> Doug Dumitru
> EasyCo LLC
>>
>> Niels
>
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
s***@public.gmane.org
2013-11-01 14:37:52 UTC
Permalink
Are you sure its because of that?

I want to start using ZFS on pure SSDs (the price per IOps is way better
than HDD, even in enterprise SSDs), but this write amplification is a deal
breaker for me.

Should I do some tuning if I want to use pure ssd? How much is the write
amplification during workloads like web server of mysql server?

ZFS is using 128kB blocksize in datasets as default? Should I make it to 4k
for database server?


Thank you.

On Thursday, October 31, 2013 6:41:50 PM UTC+1, LustreOne wrote:
>
> I think a major part of the problem here may that ZFS will normally use
> 128kB blocksize for large files. I know it is possible to tune this for a
> ZVOL but is it possible to tune it for a regular file? If you have a 4kB
> random write workload then 128kB blocks will be a killer.
>
> I'm not sure if there is a way to tell what the chosen blocksize is for an
> individual file. Is that reported on a per-file basis by stat() on the file
> in st_blksize?
>
> Cheers, Andreas
>
> On 2013-10-31, at 11:30, Doug Dumitru <dougdumit...-***@public.gmane.org<javascript:>>
> wrote:
>
> On Thursday, October 31, 2013 3:44:34 AM UTC-7, Niels de Carpentier wrote:
>>
>> >
>> >
>> > On Tuesday, October 29, 2013 2:36:21 PM UTC-7, db48x wrote:
>> >>
>> >> What do you mean by 'raid-0' when you're talking about ZFS? Do you
>> >> really
>> >> mean 24 top-level vdevs each consisting of 1 drive?
>> >>
>> >> Regardless, write amplification is to be expected; remember that each
>> >> block in your vdev is a leaf in a Merkle tree, and that each change to
>> a
>> >> block requires writing new versions of all of the ancestor blocks as
>> >> well.
>> >>
>> >> The setup is just create a zpool with 24 backing drives, ie:
>> >
>> > zpool create data /dev/sd[b-y] -f
>>
>> I'm missing -o shift=12 here. Did you use that when testing?
>>
>
> I have tried this with ashift=9 and ashift=12. Modern SSDs tend to be 4K
> (and sometimes 8K internally), but the ZFS write streams are not as bad as
> a true random workload.
>
> ashift=9 seems to amplify a little less than ashift=12, but raid-z
> performance drops off a lot more. 4K random writes into a zvol are just
> under 80K IOPS but drop below 30K IOPS with raid-z1 and ashift=512. The
> wear amp goes up, but there is more "state" going on here.
>
> My tests are very dependent on the "state of the volume". Sometimes I get
> "hot drives" where a few drives are getting a lot more writes than the rest
> of the group. This only happens with raid and is most common with
> raid-z3. The scary thing about this is that with flash, it is the last
> thing you want to see.
>
> Again, back to "state", the order of tests impacts wear. I suspect this
> is not drive conditioning issues as the drives are never very busy. With
> 24 drives, I have never seen the drive %util above 30% except on pure
> random reads. I have seen wear amp as high as 23:1 (which is really bad)
> and as low as 1.7:1 for 4K random writes. 64K random writes are much
> better (which would be expected).
>
> My intent here is perhaps not 100% pure. My company has commercial
> software specifically designed for SSDs. No where near the feature set of
> ZFS (not even trying), but some of the issues like data integrity are
> addressed (although with a completely different method). We focus on wear
> and 4K random writes and can reach 1M IOPS with de-dupe turned on with this
> array. But my intention is not to produce an advert. We just see lots of
> ZFS "talk" and wanted to see how well it maps to SSDs. It looks like SSDs
> were not the design target and this has led to some trade-offs. This is
> not bad, just how it worked out. I am somewhat surprised that in all my
> "google searching" I have never seen any mention of wear amplification in
> discussions. It seems that the design implies that it has to exist, but
> that no-one have ever measured it. The trend in flash is to go to lower
> and lower geometries which in turn leads to lower and lower endurance.
> Thus the concern with wear. Everything in ZFS seems to point to flash on
> the L2ARC and ZIL device but with HDDs still running the main store. If
> ZFS needs high endurance media because of wear amplification, this hybrid
> approach may remain the norm.
>
> Again, too much advert. If you want to discuss what I am doing more,
> please contact me off-group. If you have any suggestions for my testing
> with ZFS, discussions here are appropriate.
>
> Doug Dumitru
> EasyCo LLC
>
>>
>> Niels
>>
>>
>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-01 18:03:29 UTC
Permalink
This post might be inappropriate. Click to display it.
s***@public.gmane.org
2013-11-01 18:29:45 UTC
Permalink
These numbers are really bad.

Like you said 1.3:1 wear wear amplification should be acceptable (for
hashes etc.) but 8.43:1 or 10.34:1 for volblocksize=4K is really bad.

Can you please test it with dataset and use recordsize=4k?

Can you also test only with 3 SSDs with raidz1 ? Since with raidz the iops
of the pool is the slowest vdev (in your case ssd). So even if you have 24
drives, your IOps will be only the iops of one drive.

On Friday, November 1, 2013 7:03:29 PM UTC+1, Doug Dumitru wrote:
>
> I suspect the wear is different on SSDs than on HDDs because of scheduling
> issues. SSDs tend to turn around IO requests faster than timer ticks so it
> is easy to "chatter" a scheduler algorithm.
>
> In my test case, there array is quite large. I know that "mirroring" is
> faster than raid-z1, but the goal with SSDs is to keep the costs as
> reasonable as possible. In my case, I try to write code for SSDs that:
>
> * Keep the wear amplification as low as possible. 1.3:1 is good for
> incompressible, undedupable data. Lower if data reduction can work.
> * Keep the redundancy overhead as low as possible. 22+2 (raid-5 with a
> hot-spare) for a 24 drive array is typical. Write perfect raid stripes.
> Never let a read/modify/write raid operation happen.
> * Run with consumer media if possible. With good wear this allows for 1
> array overwrite per day with 5 year life.
>
> My ZFS tests are "all over the map". While I understand that mirrors are
> faster, they should not be lower wear. You are after all writing to both
> drives.
>
> Here is my latest, simple test:
>
> #!/bin/bash -x
>
> ./stop-zfs.sh
>
> modprobe zfs zfs_arc_min=$((1024*1024*1024))
> zfs_arc_max=$((8192*1024*1024))
>
> zpool create data raidz1 /dev/sd[b-y] -f -o ashift=12
> zfs create -V 300G data/test -o volblocksize=4K
> zfs set logbias=throughput data/test
>
> ./bm-flash --size=$((256*1024)) -wc --max=16384 /dev/data/test
>
> ( sleep 5 ; iostat -mx 30 2 ) &
> ./bm-flash --size=$((256*1024)) -w --min=4096 --max=4096 --th1=10 --th2=0
> --th3=0 --tm=40 /dev/data/test
>
> The output from the 30 second iostat snapshot is:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.07 0.27 0.00 0.00
> 7.20 0.00 0.40 0.00 0.50 0.40 0.01
> dm-0 0.00 0.00 0.00 0.27 0.00 0.00
> 6.00 0.00 0.50 0.00 0.50 0.50 0.01
> dm-1 0.00 0.00 0.00 0.20 0.00 0.00
> 8.00 0.00 0.67 0.00 0.67 0.67 0.01
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 706.97 3039.67 2.76 16.39
> 10.47 0.79 0.21 0.18 0.22 0.09 33.73
> sdb 0.00 0.00 2046.90 4083.10 8.00 23.31
> 10.46 1.30 0.21 0.19 0.22 0.08 50.59
> sdd 0.00 0.00 2089.07 4086.60 8.16 23.31
> 10.44 1.32 0.21 0.19 0.22 0.08 51.47
> sde 0.00 0.00 675.73 1412.77 2.64 7.07
> 9.52 0.43 0.20 0.18 0.22 0.11 23.72
> sdf 0.00 0.00 2065.53 4087.60 8.07 23.32
> 10.45 1.31 0.21 0.19 0.22 0.08 51.49
> sdh 0.00 0.00 2094.37 4062.10 8.18 23.31
> 10.48 1.30 0.21 0.19 0.22 0.08 51.15
> sdg 0.00 0.00 710.70 3042.20 2.78 16.39
> 10.46 0.79 0.21 0.18 0.22 0.09 33.40
> sdi 0.00 0.00 669.63 1409.23 2.62 7.07
> 9.54 0.42 0.20 0.18 0.22 0.11 23.28
> sdp 0.00 0.00 2096.43 3748.83 8.19 23.32
> 11.04 1.50 0.26 0.35 0.20 0.09 55.20
> sdq 0.00 0.00 670.33 1405.67 2.62 7.07
> 9.56 0.42 0.20 0.19 0.21 0.11 22.35
> sdj 0.00 0.00 2053.67 3771.50 8.02 23.32
> 11.02 1.46 0.25 0.34 0.20 0.09 54.75
> sdk 0.00 0.00 713.17 3012.00 2.79 16.39
> 10.54 0.82 0.22 0.27 0.21 0.09 34.25
> sdl 0.00 0.00 2094.93 3764.97 8.18 23.32
> 11.01 1.47 0.25 0.33 0.21 0.09 54.12
> sdm 0.00 0.00 670.00 1405.60 2.62 7.07
> 9.56 0.41 0.20 0.17 0.21 0.11 22.36
> sdn 0.00 0.00 2048.33 3793.67 8.00 23.32
> 10.98 1.45 0.25 0.34 0.20 0.09 54.21
> sdo 0.00 0.00 713.07 3024.23 2.79 16.39
> 10.51 0.82 0.22 0.28 0.21 0.09 34.35
> sdr 0.00 0.00 2044.40 4084.47 7.99 23.32
> 10.46 1.30 0.21 0.19 0.22 0.08 50.89
> sds 0.00 0.00 702.43 3035.97 2.74 16.39
> 10.48 0.80 0.21 0.18 0.22 0.09 34.35
> sdu 0.00 0.00 675.80 1410.17 2.64 7.07
> 9.53 0.42 0.20 0.18 0.22 0.11 23.76
> sdt 0.00 0.00 2074.70 4071.50 8.10 23.31
> 10.47 1.33 0.22 0.19 0.23 0.08 51.84
> sdy 0.00 0.00 669.27 1410.20 2.61 7.07
> 9.54 0.42 0.20 0.18 0.21 0.11 23.03
> sdx 0.00 0.00 2085.13 4086.00 8.15 23.31
> 10.44 1.33 0.22 0.19 0.23 0.08 52.08
> sdw 0.00 0.00 704.43 3056.30 2.75 16.39
> 10.42 0.79 0.21 0.18 0.22 0.09 34.13
> sdv 0.00 0.00 2054.43 4087.33 8.03 23.31
> 10.45 1.33 0.22 0.20 0.23 0.08 52.15
> zd0 0.00 0.00 0.00 11713.10 0.00 45.75
> 8.00 9.90 0.85 0.00 0.85 0.09 100.00
>
> The benchmark program is a C program that opens a block device with
> O_DIRECT. All IO is aligned both on block and memory boundaries. The
> "preconditioning" fills 256GB of the test volume with random data and then
> does some writes at 512byte to 16Kbyte blocks at q=1 to q=40. Just a
> repeatable set of IOs to get the device started. It should be noted that
> the actual ZPOOL is 3TB so this 256GB is no where near full from a total
> device viewpoint.
>
> sda and the dm-??? devices are the boot devices and their logical
> volumes. sdb - sdy are the 24 SSDs used in the test. The 4K random write
> test itself ran at 11,713 IOPS which is about 46MB/sec. You can see this
> on the zd0 line from iostat. The zd0 device was 100% busy.
>
> The 24 SSDs were not 100% busy. Also, they did not see even amounts of
> IO. If you add up with write MB/sec columns, you get 421 MB/sec for a wear
> amplification of 9.19:1. Even worse, if you take the busiest drive, the
> wear amplification is 12.23:1
>
> I re-ran this creating the zvol with volblocksize=4K and the results are
> in some ways worse.
>
> First, the linear fill speed filling the device went from 815MB/sec down
> to 404 MB/sec. Here is the iostat capture 4K random write part of this
> test:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 0.17 0.00 0.00
> 4.80 0.00 0.00 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.00 0.17 0.00 0.00
> 4.80 0.00 0.00 0.00 0.00 0.00 0.00
> dm-1 0.00 0.00 0.00 0.10 0.00 0.00
> 8.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 319.03 1765.80 1.25 9.62
> 10.67 0.44 0.21 0.19 0.21 0.10 19.89
> sdb 0.00 0.00 932.57 2888.70 3.64 15.83
> 10.43 0.82 0.21 0.19 0.22 0.08 29.57
> sdd 0.00 0.00 937.03 2893.70 3.66 15.82
> 10.42 0.81 0.21 0.19 0.22 0.08 29.77
> sde 0.00 0.00 305.40 1989.73 1.19 10.47
> 10.41 0.48 0.21 0.19 0.21 0.09 19.80
> sdf 0.00 0.00 919.13 2894.97 3.59 15.82
> 10.42 0.81 0.21 0.19 0.22 0.08 29.25
> sdh 0.00 0.00 928.33 2893.90 3.63 15.83
> 10.42 0.82 0.21 0.19 0.22 0.08 29.51
> sdg 0.00 0.00 308.97 1773.57 1.21 9.60
> 10.62 0.43 0.21 0.18 0.21 0.09 19.31
> sdi 0.00 0.00 311.17 1990.53 1.22 10.44
> 10.37 0.49 0.21 0.19 0.22 0.09 20.29
> sdp 0.00 0.00 932.40 2791.30 3.64 15.83
> 10.71 0.88 0.24 0.36 0.20 0.08 31.45
> sdq 0.00 0.00 303.73 2016.53 1.19 10.45
> 10.27 0.49 0.21 0.26 0.20 0.09 20.23
> sdj 0.00 0.00 934.50 2814.23 3.65 15.82
> 10.64 0.86 0.23 0.33 0.20 0.08 29.99
> sdk 0.00 0.00 316.53 1792.63 1.24 9.59
> 10.51 0.43 0.20 0.26 0.19 0.09 19.83
> sdl 0.00 0.00 935.73 2861.43 3.66 15.81
> 10.50 0.83 0.22 0.30 0.19 0.08 29.92
> sdm 0.00 0.00 306.20 2014.37 1.20 10.41
> 10.25 0.49 0.21 0.27 0.20 0.09 20.28
> sdn 0.00 0.00 929.53 2825.83 3.63 15.82
> 10.61 0.87 0.23 0.35 0.19 0.08 31.47
> sdo 0.00 0.00 310.53 1798.07 1.21 9.61
> 10.51 0.42 0.20 0.24 0.20 0.09 19.85
> sdr 0.00 0.00 918.30 2887.80 3.59 15.82
> 10.44 0.81 0.21 0.20 0.22 0.08 29.68
> sds 0.00 0.00 314.67 1773.30 1.23 9.61
> 10.63 0.44 0.21 0.20 0.21 0.09 19.71
> sdu 0.00 0.00 307.03 1992.40 1.20 10.48
> 10.40 0.50 0.22 0.18 0.22 0.09 20.44
> sdt 0.00 0.00 930.23 2894.17 3.63 15.85
> 10.43 0.83 0.22 0.20 0.22 0.08 30.05
> sdy 0.00 0.00 310.40 1994.33 1.21 10.48
> 10.39 0.49 0.21 0.18 0.22 0.09 19.88
> sdx 0.00 0.00 936.17 2887.07 3.66 15.85
> 10.45 0.83 0.22 0.21 0.22 0.08 30.64
> sdw 0.00 0.00 314.67 1778.27 1.23 9.63
> 10.63 0.44 0.21 0.21 0.21 0.09 19.77
> sdv 0.00 0.00 928.20 2899.53 3.63 15.86
> 10.43 0.83 0.22 0.20 0.22 0.08 29.89
> zd0 0.00 0.00 2.77 9427.73 0.01 36.83
> 8.00 6.90 0.73 2.31 0.73 0.07 70.43
>
> If you add up the numbers, the wear amplification is better at 8.43:1 or
> 10.34:1 for the hottest drive, but this is still quite bad. Even worse,
> the write IOPS is now down to 9427 and 37 MB/sec. Remember that this array
> can write linearly at around 6000 MB/sec.
>
> I suspect my results are skewed by zvols. Then again, they probably match
> in-place updates inside of a single file. So if you are running a database
> you will probably see these numbers.
>
> ZFS has some amazing features. It just looks like it's design decisions
> were not made with SSDs in mind.
>
> I have posted the 'bm-flash' binary to:
>
>
> https://drive.google.com/file/d/0B3T4AZzjEGVkemkzWlhubmNmb0E/edit?usp=sharing
>
> I will post the source in a couple of days (if anyone is interested).
>
> Doug Dumitru
> EasyCo LLC
>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
aurfalien
2013-11-01 18:36:35 UTC
Permalink
Kind of curious.

Did you really mean a single 22 disk vdev at RaidZ?

Based on my tests so far, the more vdevs you've striped, the more performance.

I may be off with your particular scenario but have you tried to redo your pool for multiple vdevs? I found my sweet spot to be multiple 3 disk raidz vdevs striped better then fewer multiple 6 disk raidz2 vdevs.

- aurf

On Nov 1, 2013, at 11:03 AM, Doug Dumitru wrote:

> I suspect the wear is different on SSDs than on HDDs because of scheduling issues. SSDs tend to turn around IO requests faster than timer ticks so it is easy to "chatter" a scheduler algorithm.
>
> In my test case, there array is quite large. I know that "mirroring" is faster than raid-z1, but the goal with SSDs is to keep the costs as reasonable as possible. In my case, I try to write code for SSDs that:
>
> * Keep the wear amplification as low as possible. 1.3:1 is good for incompressible, undedupable data. Lower if data reduction can work.
> * Keep the redundancy overhead as low as possible. 22+2 (raid-5 with a hot-spare) for a 24 drive array is typical. Write perfect raid stripes. Never let a read/modify/write raid operation happen.
> * Run with consumer media if possible. With good wear this allows for 1 array overwrite per day with 5 year life.
>
> My ZFS tests are "all over the map". While I understand that mirrors are faster, they should not be lower wear. You are after all writing to both drives.
>
> Here is my latest, simple test:
>
> #!/bin/bash -x
>
> ./stop-zfs.sh
>
> modprobe zfs zfs_arc_min=$((1024*1024*1024)) zfs_arc_max=$((8192*1024*1024))
>
> zpool create data raidz1 /dev/sd[b-y] -f -o ashift=12
> zfs create -V 300G data/test -o volblocksize=4K
> zfs set logbias=throughput data/test
>
> ./bm-flash --size=$((256*1024)) -wc --max=16384 /dev/data/test
>
> ( sleep 5 ; iostat -mx 30 2 ) &
> ./bm-flash --size=$((256*1024)) -w --min=4096 --max=4096 --th1=10 --th2=0 --th3=0 --tm=40 /dev/data/test
>
> The output from the 30 second iostat snapshot is:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.07 0.27 0.00 0.00 7.20 0.00 0.40 0.00 0.50 0.40 0.01
> dm-0 0.00 0.00 0.00 0.27 0.00 0.00 6.00 0.00 0.50 0.00 0.50 0.50 0.01
> dm-1 0.00 0.00 0.00 0.20 0.00 0.00 8.00 0.00 0.67 0.00 0.67 0.67 0.01
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 706.97 3039.67 2.76 16.39 10.47 0.79 0.21 0.18 0.22 0.09 33.73
> sdb 0.00 0.00 2046.90 4083.10 8.00 23.31 10.46 1.30 0.21 0.19 0.22 0.08 50.59
> sdd 0.00 0.00 2089.07 4086.60 8.16 23.31 10.44 1.32 0.21 0.19 0.22 0.08 51.47
> sde 0.00 0.00 675.73 1412.77 2.64 7.07 9.52 0.43 0.20 0.18 0.22 0.11 23.72
> sdf 0.00 0.00 2065.53 4087.60 8.07 23.32 10.45 1.31 0.21 0.19 0.22 0.08 51.49
> sdh 0.00 0.00 2094.37 4062.10 8.18 23.31 10.48 1.30 0.21 0.19 0.22 0.08 51.15
> sdg 0.00 0.00 710.70 3042.20 2.78 16.39 10.46 0.79 0.21 0.18 0.22 0.09 33.40
> sdi 0.00 0.00 669.63 1409.23 2.62 7.07 9.54 0.42 0.20 0.18 0.22 0.11 23.28
> sdp 0.00 0.00 2096.43 3748.83 8.19 23.32 11.04 1.50 0.26 0.35 0.20 0.09 55.20
> sdq 0.00 0.00 670.33 1405.67 2.62 7.07 9.56 0.42 0.20 0.19 0.21 0.11 22.35
> sdj 0.00 0.00 2053.67 3771.50 8.02 23.32 11.02 1.46 0.25 0.34 0.20 0.09 54.75
> sdk 0.00 0.00 713.17 3012.00 2.79 16.39 10.54 0.82 0.22 0.27 0.21 0.09 34.25
> sdl 0.00 0.00 2094.93 3764.97 8.18 23.32 11.01 1.47 0.25 0.33 0.21 0.09 54.12
> sdm 0.00 0.00 670.00 1405.60 2.62 7.07 9.56 0.41 0.20 0.17 0.21 0.11 22.36
> sdn 0.00 0.00 2048.33 3793.67 8.00 23.32 10.98 1.45 0.25 0.34 0.20 0.09 54.21
> sdo 0.00 0.00 713.07 3024.23 2.79 16.39 10.51 0.82 0.22 0.28 0.21 0.09 34.35
> sdr 0.00 0.00 2044.40 4084.47 7.99 23.32 10.46 1.30 0.21 0.19 0.22 0.08 50.89
> sds 0.00 0.00 702.43 3035.97 2.74 16.39 10.48 0.80 0.21 0.18 0.22 0.09 34.35
> sdu 0.00 0.00 675.80 1410.17 2.64 7.07 9.53 0.42 0.20 0.18 0.22 0.11 23.76
> sdt 0.00 0.00 2074.70 4071.50 8.10 23.31 10.47 1.33 0.22 0.19 0.23 0.08 51.84
> sdy 0.00 0.00 669.27 1410.20 2.61 7.07 9.54 0.42 0.20 0.18 0.21 0.11 23.03
> sdx 0.00 0.00 2085.13 4086.00 8.15 23.31 10.44 1.33 0.22 0.19 0.23 0.08 52.08
> sdw 0.00 0.00 704.43 3056.30 2.75 16.39 10.42 0.79 0.21 0.18 0.22 0.09 34.13
> sdv 0.00 0.00 2054.43 4087.33 8.03 23.31 10.45 1.33 0.22 0.20 0.23 0.08 52.15
> zd0 0.00 0.00 0.00 11713.10 0.00 45.75 8.00 9.90 0.85 0.00 0.85 0.09 100.00
>
> The benchmark program is a C program that opens a block device with O_DIRECT. All IO is aligned both on block and memory boundaries. The "preconditioning" fills 256GB of the test volume with random data and then does some writes at 512byte to 16Kbyte blocks at q=1 to q=40. Just a repeatable set of IOs to get the device started. It should be noted that the actual ZPOOL is 3TB so this 256GB is no where near full from a total device viewpoint.
>
> sda and the dm-??? devices are the boot devices and their logical volumes. sdb - sdy are the 24 SSDs used in the test. The 4K random write test itself ran at 11,713 IOPS which is about 46MB/sec. You can see this on the zd0 line from iostat. The zd0 device was 100% busy.
>
> The 24 SSDs were not 100% busy. Also, they did not see even amounts of IO. If you add up with write MB/sec columns, you get 421 MB/sec for a wear amplification of 9.19:1. Even worse, if you take the busiest drive, the wear amplification is 12.23:1
>
> I re-ran this creating the zvol with volblocksize=4K and the results are in some ways worse.
>
> First, the linear fill speed filling the device went from 815MB/sec down to 404 MB/sec. Here is the iostat capture 4K random write part of this test:
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 0.17 0.00 0.00 4.80 0.00 0.00 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.00 0.17 0.00 0.00 4.80 0.00 0.00 0.00 0.00 0.00 0.00
> dm-1 0.00 0.00 0.00 0.10 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 319.03 1765.80 1.25 9.62 10.67 0.44 0.21 0.19 0.21 0.10 19.89
> sdb 0.00 0.00 932.57 2888.70 3.64 15.83 10.43 0.82 0.21 0.19 0.22 0.08 29.57
> sdd 0.00 0.00 937.03 2893.70 3.66 15.82 10.42 0.81 0.21 0.19 0.22 0.08 29.77
> sde 0.00 0.00 305.40 1989.73 1.19 10.47 10.41 0.48 0.21 0.19 0.21 0.09 19.80
> sdf 0.00 0.00 919.13 2894.97 3.59 15.82 10.42 0.81 0.21 0.19 0.22 0.08 29.25
> sdh 0.00 0.00 928.33 2893.90 3.63 15.83 10.42 0.82 0.21 0.19 0.22 0.08 29.51
> sdg 0.00 0.00 308.97 1773.57 1.21 9.60 10.62 0.43 0.21 0.18 0.21 0.09 19.31
> sdi 0.00 0.00 311.17 1990.53 1.22 10.44 10.37 0.49 0.21 0.19 0.22 0.09 20.29
> sdp 0.00 0.00 932.40 2791.30 3.64 15.83 10.71 0.88 0.24 0.36 0.20 0.08 31.45
> sdq 0.00 0.00 303.73 2016.53 1.19 10.45 10.27 0.49 0.21 0.26 0.20 0.09 20.23
> sdj 0.00 0.00 934.50 2814.23 3.65 15.82 10.64 0.86 0.23 0.33 0.20 0.08 29.99
> sdk 0.00 0.00 316.53 1792.63 1.24 9.59 10.51 0.43 0.20 0.26 0.19 0.09 19.83
> sdl 0.00 0.00 935.73 2861.43 3.66 15.81 10.50 0.83 0.22 0.30 0.19 0.08 29.92
> sdm 0.00 0.00 306.20 2014.37 1.20 10.41 10.25 0.49 0.21 0.27 0.20 0.09 20.28
> sdn 0.00 0.00 929.53 2825.83 3.63 15.82 10.61 0.87 0.23 0.35 0.19 0.08 31.47
> sdo 0.00 0.00 310.53 1798.07 1.21 9.61 10.51 0.42 0.20 0.24 0.20 0.09 19.85
> sdr 0.00 0.00 918.30 2887.80 3.59 15.82 10.44 0.81 0.21 0.20 0.22 0.08 29.68
> sds 0.00 0.00 314.67 1773.30 1.23 9.61 10.63 0.44 0.21 0.20 0.21 0.09 19.71
> sdu 0.00 0.00 307.03 1992.40 1.20 10.48 10.40 0.50 0.22 0.18 0.22 0.09 20.44
> sdt 0.00 0.00 930.23 2894.17 3.63 15.85 10.43 0.83 0.22 0.20 0.22 0.08 30.05
> sdy 0.00 0.00 310.40 1994.33 1.21 10.48 10.39 0.49 0.21 0.18 0.22 0.09 19.88
> sdx 0.00 0.00 936.17 2887.07 3.66 15.85 10.45 0.83 0.22 0.21 0.22 0.08 30.64
> sdw 0.00 0.00 314.67 1778.27 1.23 9.63 10.63 0.44 0.21 0.21 0.21 0.09 19.77
> sdv 0.00 0.00 928.20 2899.53 3.63 15.86 10.43 0.83 0.22 0.20 0.22 0.08 29.89
> zd0 0.00 0.00 2.77 9427.73 0.01 36.83 8.00 6.90 0.73 2.31 0.73 0.07 70.43
>
> If you add up the numbers, the wear amplification is better at 8.43:1 or 10.34:1 for the hottest drive, but this is still quite bad. Even worse, the write IOPS is now down to 9427 and 37 MB/sec. Remember that this array can write linearly at around 6000 MB/sec.
>
> I suspect my results are skewed by zvols. Then again, they probably match in-place updates inside of a single file. So if you are running a database you will probably see these numbers.
>
> ZFS has some amazing features. It just looks like it's design decisions were not made with SSDs in mind.
>
> I have posted the 'bm-flash' binary to:
>
> https://drive.google.com/file/d/0B3T4AZzjEGVkemkzWlhubmNmb0E/edit?usp=sharing
>
> I will post the source in a couple of days (if anyone is interested).
>
> Doug Dumitru
> EasyCo LLC
>
>
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Chris Siebenmann
2013-11-01 18:44:47 UTC
Permalink
| My ZFS tests are "all over the map". While I understand that mirrors are
| faster, they should not be lower wear. You are after all writing to both
| drives.

My understanding is that 4K random writes to a ZFS raidzN with
ashift=12 are in many ways a worst case for write amplification. The
design decision/problem is that ZFS does RAID parity over variable-sized
blocks of however much you write (up to the volume block size or the
record size). If you write 4K on an ashift=12 vdev, ZFS must write a
4K data block and then N 4K parity blocks; effectively you're doing
mirroring (or worse) without realizing it.

Because ZFS metadata (and data) is fundamentally a Merkle tree and your
individual metadata elements are often small, I believe that this effect
can ripple up the tree, eg if you rewrite a directory that takes 4K
because it now has to point to the new file extent that points to the new
data block, it too is write-amplified N times. Fortunately you only need
to write the new Merkle tree at TXG group boundaries, but I believe those
are every five seconds by default.

In this situation you'll also see unbalanced IO depending on which
disks ZFS chooses to put the 4K data block and the N blocks of parity
on. I don't know how ZFS chooses which disks to use here and how it
levels them out over time.

- cks

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-11-01 10:38:12 UTC
Permalink
>
> My intent here is perhaps not 100% pure. My company has commercial
> software specifically designed for SSDs. No where near the feature set of
> ZFS (not even trying), but some of the issues like data integrity are
> addressed (although with a completely different method). We focus on wear
> and 4K random writes and can reach 1M IOPS with de-dupe turned on with
> this
> array. But my intention is not to produce an advert. We just see lots of
> ZFS "talk" and wanted to see how well it maps to SSDs. It looks like SSDs
> were not the design target and this has led to some trade-offs. This is
> not bad, just how it worked out. I am somewhat surprised that in all my
> "google searching" I have never seen any mention of wear amplification in
> discussions. It seems that the design implies that it has to exist, but
> that no-one have ever measured it. The trend in flash is to go to lower
> and lower geometries which in turn leads to lower and lower endurance.
> Thus the concern with wear. Everything in ZFS seems to point to flash on
> the L2ARC and ZIL device but with HDDs still running the main store. If
> ZFS needs high endurance media because of wear amplification, this hybrid
> approach may remain the norm.

Likely this behaviour is something specific to your setup/testing
methodology. (And likely the wrong zvol blocksize). Write amplification on
spinning disks would be bad as well if it causes extra seeks, and the
performance impact would be much larger in that case. Your raidz3 setup is
against all recommendations, and is documented to give performance issues.
A stripe of mirrors is the normal setup in such a setup, or multiple
smaller raidz1/2/3 vdev's. I would think ZFS would actually be well suited
for SSD's, since it does a sort of wear leveling by itself. I'll try to do
some testing when I have time, and see if I can reproduce your results.

Niels

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-01 18:22:11 UTC
Permalink
Err... 23-disk RAID5 is way beyond silly, both in terms of performance and reliability.


Doug Dumitru <dougdumitruredirect-***@public.gmane.org> wrote:

>I suspect the wear is different on SSDs than on HDDs because of scheduling
>issues. SSDs tend to turn around IO requests faster than timer ticks so it
>is easy to "chatter" a scheduler algorithm.
>
>In my test case, there array is quite large. I know that "mirroring" is
>faster than raid-z1, but the goal with SSDs is to keep the costs as
>reasonable as possible. In my case, I try to write code for SSDs that:
>
>* Keep the wear amplification as low as possible. 1.3:1 is good for
>incompressible, undedupable data. Lower if data reduction can work.
>* Keep the redundancy overhead as low as possible. 22+2 (raid-5 with a
>hot-spare) for a 24 drive array is typical. Write perfect raid stripes.
>Never let a read/modify/write raid operation happen.
>* Run with consumer media if possible. With good wear this allows for 1
>array overwrite per day with 5 year life.
>
>My ZFS tests are "all over the map". While I understand that mirrors are
>faster, they should not be lower wear. You are after all writing to both
>drives.
>
>Here is my latest, simple test:
>
>#!/bin/bash -x
>
>./stop-zfs.sh
>
>modprobe zfs zfs_arc_min=$((1024*1024*1024)) zfs_arc_max=$((8192*1024*1024))
>
>zpool create data raidz1 /dev/sd[b-y] -f -o ashift=12
>zfs create -V 300G data/test -o volblocksize=4K
>zfs set logbias=throughput data/test
>
>./bm-flash --size=$((256*1024)) -wc --max=16384 /dev/data/test
>
>( sleep 5 ; iostat -mx 30 2 ) &
>./bm-flash --size=$((256*1024)) -w --min=4096 --max=4096 --th1=10 --th2=0
>--th3=0 --tm=40 /dev/data/test
>
>The output from the 30 second iostat snapshot is:
>
>Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
>avgqu-sz await r_await w_await svctm %util
>sda 0.00 0.00 0.07 0.27 0.00 0.00
>7.20 0.00 0.40 0.00 0.50 0.40 0.01
>dm-0 0.00 0.00 0.00 0.27 0.00 0.00
>6.00 0.00 0.50 0.00 0.50 0.50 0.01
>dm-1 0.00 0.00 0.00 0.20 0.00 0.00
>8.00 0.00 0.67 0.00 0.67 0.67 0.01
>dm-2 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-3 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-4 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>sdc 0.00 0.00 706.97 3039.67 2.76 16.39
>10.47 0.79 0.21 0.18 0.22 0.09 33.73
>sdb 0.00 0.00 2046.90 4083.10 8.00 23.31
>10.46 1.30 0.21 0.19 0.22 0.08 50.59
>sdd 0.00 0.00 2089.07 4086.60 8.16 23.31
>10.44 1.32 0.21 0.19 0.22 0.08 51.47
>sde 0.00 0.00 675.73 1412.77 2.64 7.07
>9.52 0.43 0.20 0.18 0.22 0.11 23.72
>sdf 0.00 0.00 2065.53 4087.60 8.07 23.32
>10.45 1.31 0.21 0.19 0.22 0.08 51.49
>sdh 0.00 0.00 2094.37 4062.10 8.18 23.31
>10.48 1.30 0.21 0.19 0.22 0.08 51.15
>sdg 0.00 0.00 710.70 3042.20 2.78 16.39
>10.46 0.79 0.21 0.18 0.22 0.09 33.40
>sdi 0.00 0.00 669.63 1409.23 2.62 7.07
>9.54 0.42 0.20 0.18 0.22 0.11 23.28
>sdp 0.00 0.00 2096.43 3748.83 8.19 23.32
>11.04 1.50 0.26 0.35 0.20 0.09 55.20
>sdq 0.00 0.00 670.33 1405.67 2.62 7.07
>9.56 0.42 0.20 0.19 0.21 0.11 22.35
>sdj 0.00 0.00 2053.67 3771.50 8.02 23.32
>11.02 1.46 0.25 0.34 0.20 0.09 54.75
>sdk 0.00 0.00 713.17 3012.00 2.79 16.39
>10.54 0.82 0.22 0.27 0.21 0.09 34.25
>sdl 0.00 0.00 2094.93 3764.97 8.18 23.32
>11.01 1.47 0.25 0.33 0.21 0.09 54.12
>sdm 0.00 0.00 670.00 1405.60 2.62 7.07
>9.56 0.41 0.20 0.17 0.21 0.11 22.36
>sdn 0.00 0.00 2048.33 3793.67 8.00 23.32
>10.98 1.45 0.25 0.34 0.20 0.09 54.21
>sdo 0.00 0.00 713.07 3024.23 2.79 16.39
>10.51 0.82 0.22 0.28 0.21 0.09 34.35
>sdr 0.00 0.00 2044.40 4084.47 7.99 23.32
>10.46 1.30 0.21 0.19 0.22 0.08 50.89
>sds 0.00 0.00 702.43 3035.97 2.74 16.39
>10.48 0.80 0.21 0.18 0.22 0.09 34.35
>sdu 0.00 0.00 675.80 1410.17 2.64 7.07
>9.53 0.42 0.20 0.18 0.22 0.11 23.76
>sdt 0.00 0.00 2074.70 4071.50 8.10 23.31
>10.47 1.33 0.22 0.19 0.23 0.08 51.84
>sdy 0.00 0.00 669.27 1410.20 2.61 7.07
>9.54 0.42 0.20 0.18 0.21 0.11 23.03
>sdx 0.00 0.00 2085.13 4086.00 8.15 23.31
>10.44 1.33 0.22 0.19 0.23 0.08 52.08
>sdw 0.00 0.00 704.43 3056.30 2.75 16.39
>10.42 0.79 0.21 0.18 0.22 0.09 34.13
>sdv 0.00 0.00 2054.43 4087.33 8.03 23.31
>10.45 1.33 0.22 0.20 0.23 0.08 52.15
>zd0 0.00 0.00 0.00 11713.10 0.00 45.75
>8.00 9.90 0.85 0.00 0.85 0.09 100.00
>
>The benchmark program is a C program that opens a block device with
>O_DIRECT. All IO is aligned both on block and memory boundaries. The
>"preconditioning" fills 256GB of the test volume with random data and then
>does some writes at 512byte to 16Kbyte blocks at q=1 to q=40. Just a
>repeatable set of IOs to get the device started. It should be noted that
>the actual ZPOOL is 3TB so this 256GB is no where near full from a total
>device viewpoint.
>
>sda and the dm-??? devices are the boot devices and their logical volumes.
>sdb - sdy are the 24 SSDs used in the test. The 4K random write test
>itself ran at 11,713 IOPS which is about 46MB/sec. You can see this on the
>zd0 line from iostat. The zd0 device was 100% busy.
>
>The 24 SSDs were not 100% busy. Also, they did not see even amounts of
>IO. If you add up with write MB/sec columns, you get 421 MB/sec for a wear
>amplification of 9.19:1. Even worse, if you take the busiest drive, the
>wear amplification is 12.23:1
>
>I re-ran this creating the zvol with volblocksize=4K and the results are in
>some ways worse.
>
>First, the linear fill speed filling the device went from 815MB/sec down to
>404 MB/sec. Here is the iostat capture 4K random write part of this test:
>
>Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
>avgqu-sz await r_await w_await svctm %util
>sda 0.00 0.00 0.00 0.17 0.00 0.00
>4.80 0.00 0.00 0.00 0.00 0.00 0.00
>dm-0 0.00 0.00 0.00 0.17 0.00 0.00
>4.80 0.00 0.00 0.00 0.00 0.00 0.00
>dm-1 0.00 0.00 0.00 0.10 0.00 0.00
>8.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-2 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-3 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-4 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>sdc 0.00 0.00 319.03 1765.80 1.25 9.62
>10.67 0.44 0.21 0.19 0.21 0.10 19.89
>sdb 0.00 0.00 932.57 2888.70 3.64 15.83
>10.43 0.82 0.21 0.19 0.22 0.08 29.57
>sdd 0.00 0.00 937.03 2893.70 3.66 15.82
>10.42 0.81 0.21 0.19 0.22 0.08 29.77
>sde 0.00 0.00 305.40 1989.73 1.19 10.47
>10.41 0.48 0.21 0.19 0.21 0.09 19.80
>sdf 0.00 0.00 919.13 2894.97 3.59 15.82
>10.42 0.81 0.21 0.19 0.22 0.08 29.25
>sdh 0.00 0.00 928.33 2893.90 3.63 15.83
>10.42 0.82 0.21 0.19 0.22 0.08 29.51
>sdg 0.00 0.00 308.97 1773.57 1.21 9.60
>10.62 0.43 0.21 0.18 0.21 0.09 19.31
>sdi 0.00 0.00 311.17 1990.53 1.22 10.44
>10.37 0.49 0.21 0.19 0.22 0.09 20.29
>sdp 0.00 0.00 932.40 2791.30 3.64 15.83
>10.71 0.88 0.24 0.36 0.20 0.08 31.45
>sdq 0.00 0.00 303.73 2016.53 1.19 10.45
>10.27 0.49 0.21 0.26 0.20 0.09 20.23
>sdj 0.00 0.00 934.50 2814.23 3.65 15.82
>10.64 0.86 0.23 0.33 0.20 0.08 29.99
>sdk 0.00 0.00 316.53 1792.63 1.24 9.59
>10.51 0.43 0.20 0.26 0.19 0.09 19.83
>sdl 0.00 0.00 935.73 2861.43 3.66 15.81
>10.50 0.83 0.22 0.30 0.19 0.08 29.92
>sdm 0.00 0.00 306.20 2014.37 1.20 10.41
>10.25 0.49 0.21 0.27 0.20 0.09 20.28
>sdn 0.00 0.00 929.53 2825.83 3.63 15.82
>10.61 0.87 0.23 0.35 0.19 0.08 31.47
>sdo 0.00 0.00 310.53 1798.07 1.21 9.61
>10.51 0.42 0.20 0.24 0.20 0.09 19.85
>sdr 0.00 0.00 918.30 2887.80 3.59 15.82
>10.44 0.81 0.21 0.20 0.22 0.08 29.68
>sds 0.00 0.00 314.67 1773.30 1.23 9.61
>10.63 0.44 0.21 0.20 0.21 0.09 19.71
>sdu 0.00 0.00 307.03 1992.40 1.20 10.48
>10.40 0.50 0.22 0.18 0.22 0.09 20.44
>sdt 0.00 0.00 930.23 2894.17 3.63 15.85
>10.43 0.83 0.22 0.20 0.22 0.08 30.05
>sdy 0.00 0.00 310.40 1994.33 1.21 10.48
>10.39 0.49 0.21 0.18 0.22 0.09 19.88
>sdx 0.00 0.00 936.17 2887.07 3.66 15.85
>10.45 0.83 0.22 0.21 0.22 0.08 30.64
>sdw 0.00 0.00 314.67 1778.27 1.23 9.63
>10.63 0.44 0.21 0.21 0.21 0.09 19.77
>sdv 0.00 0.00 928.20 2899.53 3.63 15.86
>10.43 0.83 0.22 0.20 0.22 0.08 29.89
>zd0 0.00 0.00 2.77 9427.73 0.01 36.83
>8.00 6.90 0.73 2.31 0.73 0.07 70.43
>
>If you add up the numbers, the wear amplification is better at 8.43:1 or
>10.34:1 for the hottest drive, but this is still quite bad. Even worse,
>the write IOPS is now down to 9427 and 37 MB/sec. Remember that this array
>can write linearly at around 6000 MB/sec.
>
>I suspect my results are skewed by zvols. Then again, they probably match
>in-place updates inside of a single file. So if you are running a database
>you will probably see these numbers.
>
>ZFS has some amazing features. It just looks like it's design decisions
>were not made with SSDs in mind.
>
>I have posted the 'bm-flash' binary to:
>
>https://drive.google.com/file/d/0B3T4AZzjEGVkemkzWlhubmNmb0E/edit?usp=sharing
>
>I will post the source in a couple of days (if anyone is interested).
>
>Doug Dumitru
>EasyCo LLC
>
>To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Schlacta, Christ
2013-11-01 18:31:41 UTC
Permalink
If you must raid z, I'd recommend 4x 4+2 or 8x 2+1 on 24 disks. 6x 3+1
might be suitable as well, but will suffer from some write related overhead
On Nov 1, 2013 11:25 AM, "Gordan Bobic" <gordan.bobic-***@public.gmane.org> wrote:

> Err... 23-disk RAID5 is way beyond silly, both in terms of performance and
> reliability.
>
>
> Doug Dumitru <dougdumitruredirect-***@public.gmane.org> wrote:
>
> >I suspect the wear is different on SSDs than on HDDs because of scheduling
> >issues. SSDs tend to turn around IO requests faster than timer ticks so
> it
> >is easy to "chatter" a scheduler algorithm.
> >
> >In my test case, there array is quite large. I know that "mirroring" is
> >faster than raid-z1, but the goal with SSDs is to keep the costs as
> >reasonable as possible. In my case, I try to write code for SSDs that:
> >
> >* Keep the wear amplification as low as possible. 1.3:1 is good for
> >incompressible, undedupable data. Lower if data reduction can work.
> >* Keep the redundancy overhead as low as possible. 22+2 (raid-5 with a
> >hot-spare) for a 24 drive array is typical. Write perfect raid stripes.
> >Never let a read/modify/write raid operation happen.
> >* Run with consumer media if possible. With good wear this allows for 1
> >array overwrite per day with 5 year life.
> >
> >My ZFS tests are "all over the map". While I understand that mirrors are
> >faster, they should not be lower wear. You are after all writing to both
> >drives.
> >
> >Here is my latest, simple test:
> >
> >#!/bin/bash -x
> >
> >./stop-zfs.sh
> >
> >modprobe zfs zfs_arc_min=$((1024*1024*1024))
> zfs_arc_max=$((8192*1024*1024))
> >
> >zpool create data raidz1 /dev/sd[b-y] -f -o ashift=12
> >zfs create -V 300G data/test -o volblocksize=4K
> >zfs set logbias=throughput data/test
> >
> >./bm-flash --size=$((256*1024)) -wc --max=16384 /dev/data/test
> >
> >( sleep 5 ; iostat -mx 30 2 ) &
> >./bm-flash --size=$((256*1024)) -w --min=4096 --max=4096 --th1=10 --th2=0
> >--th3=0 --tm=40 /dev/data/test
> >
> >The output from the 30 second iostat snapshot is:
> >
> >Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> >avgqu-sz await r_await w_await svctm %util
> >sda 0.00 0.00 0.07 0.27 0.00 0.00
> >7.20 0.00 0.40 0.00 0.50 0.40 0.01
> >dm-0 0.00 0.00 0.00 0.27 0.00 0.00
> >6.00 0.00 0.50 0.00 0.50 0.50 0.01
> >dm-1 0.00 0.00 0.00 0.20 0.00 0.00
> >8.00 0.00 0.67 0.00 0.67 0.67 0.01
> >dm-2 0.00 0.00 0.00 0.00 0.00 0.00
> >0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-3 0.00 0.00 0.00 0.00 0.00 0.00
> >0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-4 0.00 0.00 0.00 0.00 0.00 0.00
> >0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >sdc 0.00 0.00 706.97 3039.67 2.76 16.39
> >10.47 0.79 0.21 0.18 0.22 0.09 33.73
> >sdb 0.00 0.00 2046.90 4083.10 8.00 23.31
> >10.46 1.30 0.21 0.19 0.22 0.08 50.59
> >sdd 0.00 0.00 2089.07 4086.60 8.16 23.31
> >10.44 1.32 0.21 0.19 0.22 0.08 51.47
> >sde 0.00 0.00 675.73 1412.77 2.64 7.07
> >9.52 0.43 0.20 0.18 0.22 0.11 23.72
> >sdf 0.00 0.00 2065.53 4087.60 8.07 23.32
> >10.45 1.31 0.21 0.19 0.22 0.08 51.49
> >sdh 0.00 0.00 2094.37 4062.10 8.18 23.31
> >10.48 1.30 0.21 0.19 0.22 0.08 51.15
> >sdg 0.00 0.00 710.70 3042.20 2.78 16.39
> >10.46 0.79 0.21 0.18 0.22 0.09 33.40
> >sdi 0.00 0.00 669.63 1409.23 2.62 7.07
> >9.54 0.42 0.20 0.18 0.22 0.11 23.28
> >sdp 0.00 0.00 2096.43 3748.83 8.19 23.32
> >11.04 1.50 0.26 0.35 0.20 0.09 55.20
> >sdq 0.00 0.00 670.33 1405.67 2.62 7.07
> >9.56 0.42 0.20 0.19 0.21 0.11 22.35
> >sdj 0.00 0.00 2053.67 3771.50 8.02 23.32
> >11.02 1.46 0.25 0.34 0.20 0.09 54.75
> >sdk 0.00 0.00 713.17 3012.00 2.79 16.39
> >10.54 0.82 0.22 0.27 0.21 0.09 34.25
> >sdl 0.00 0.00 2094.93 3764.97 8.18 23.32
> >11.01 1.47 0.25 0.33 0.21 0.09 54.12
> >sdm 0.00 0.00 670.00 1405.60 2.62 7.07
> >9.56 0.41 0.20 0.17 0.21 0.11 22.36
> >sdn 0.00 0.00 2048.33 3793.67 8.00 23.32
> >10.98 1.45 0.25 0.34 0.20 0.09 54.21
> >sdo 0.00 0.00 713.07 3024.23 2.79 16.39
> >10.51 0.82 0.22 0.28 0.21 0.09 34.35
> >sdr 0.00 0.00 2044.40 4084.47 7.99 23.32
> >10.46 1.30 0.21 0.19 0.22 0.08 50.89
> >sds 0.00 0.00 702.43 3035.97 2.74 16.39
> >10.48 0.80 0.21 0.18 0.22 0.09 34.35
> >sdu 0.00 0.00 675.80 1410.17 2.64 7.07
> >9.53 0.42 0.20 0.18 0.22 0.11 23.76
> >sdt 0.00 0.00 2074.70 4071.50 8.10 23.31
> >10.47 1.33 0.22 0.19 0.23 0.08 51.84
> >sdy 0.00 0.00 669.27 1410.20 2.61 7.07
> >9.54 0.42 0.20 0.18 0.21 0.11 23.03
> >sdx 0.00 0.00 2085.13 4086.00 8.15 23.31
> >10.44 1.33 0.22 0.19 0.23 0.08 52.08
> >sdw 0.00 0.00 704.43 3056.30 2.75 16.39
> >10.42 0.79 0.21 0.18 0.22 0.09 34.13
> >sdv 0.00 0.00 2054.43 4087.33 8.03 23.31
> >10.45 1.33 0.22 0.20 0.23 0.08 52.15
> >zd0 0.00 0.00 0.00 11713.10 0.00 45.75
> >8.00 9.90 0.85 0.00 0.85 0.09 100.00
> >
> >The benchmark program is a C program that opens a block device with
> >O_DIRECT. All IO is aligned both on block and memory boundaries. The
> >"preconditioning" fills 256GB of the test volume with random data and then
> >does some writes at 512byte to 16Kbyte blocks at q=1 to q=40. Just a
> >repeatable set of IOs to get the device started. It should be noted that
> >the actual ZPOOL is 3TB so this 256GB is no where near full from a total
> >device viewpoint.
> >
> >sda and the dm-??? devices are the boot devices and their logical volumes.
> >sdb - sdy are the 24 SSDs used in the test. The 4K random write test
> >itself ran at 11,713 IOPS which is about 46MB/sec. You can see this on
> the
> >zd0 line from iostat. The zd0 device was 100% busy.
> >
> >The 24 SSDs were not 100% busy. Also, they did not see even amounts of
> >IO. If you add up with write MB/sec columns, you get 421 MB/sec for a
> wear
> >amplification of 9.19:1. Even worse, if you take the busiest drive, the
> >wear amplification is 12.23:1
> >
> >I re-ran this creating the zvol with volblocksize=4K and the results are
> in
> >some ways worse.
> >
> >First, the linear fill speed filling the device went from 815MB/sec down
> to
> >404 MB/sec. Here is the iostat capture 4K random write part of this test:
> >
> >Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> >avgqu-sz await r_await w_await svctm %util
> >sda 0.00 0.00 0.00 0.17 0.00 0.00
> >4.80 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-0 0.00 0.00 0.00 0.17 0.00 0.00
> >4.80 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-1 0.00 0.00 0.00 0.10 0.00 0.00
> >8.00 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-2 0.00 0.00 0.00 0.00 0.00 0.00
> >0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-3 0.00 0.00 0.00 0.00 0.00 0.00
> >0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >dm-4 0.00 0.00 0.00 0.00 0.00 0.00
> >0.00 0.00 0.00 0.00 0.00 0.00 0.00
> >sdc 0.00 0.00 319.03 1765.80 1.25 9.62
> >10.67 0.44 0.21 0.19 0.21 0.10 19.89
> >sdb 0.00 0.00 932.57 2888.70 3.64 15.83
> >10.43 0.82 0.21 0.19 0.22 0.08 29.57
> >sdd 0.00 0.00 937.03 2893.70 3.66 15.82
> >10.42 0.81 0.21 0.19 0.22 0.08 29.77
> >sde 0.00 0.00 305.40 1989.73 1.19 10.47
> >10.41 0.48 0.21 0.19 0.21 0.09 19.80
> >sdf 0.00 0.00 919.13 2894.97 3.59 15.82
> >10.42 0.81 0.21 0.19 0.22 0.08 29.25
> >sdh 0.00 0.00 928.33 2893.90 3.63 15.83
> >10.42 0.82 0.21 0.19 0.22 0.08 29.51
> >sdg 0.00 0.00 308.97 1773.57 1.21 9.60
> >10.62 0.43 0.21 0.18 0.21 0.09 19.31
> >sdi 0.00 0.00 311.17 1990.53 1.22 10.44
> >10.37 0.49 0.21 0.19 0.22 0.09 20.29
> >sdp 0.00 0.00 932.40 2791.30 3.64 15.83
> >10.71 0.88 0.24 0.36 0.20 0.08 31.45
> >sdq 0.00 0.00 303.73 2016.53 1.19 10.45
> >10.27 0.49 0.21 0.26 0.20 0.09 20.23
> >sdj 0.00 0.00 934.50 2814.23 3.65 15.82
> >10.64 0.86 0.23 0.33 0.20 0.08 29.99
> >sdk 0.00 0.00 316.53 1792.63 1.24 9.59
> >10.51 0.43 0.20 0.26 0.19 0.09 19.83
> >sdl 0.00 0.00 935.73 2861.43 3.66 15.81
> >10.50 0.83 0.22 0.30 0.19 0.08 29.92
> >sdm 0.00 0.00 306.20 2014.37 1.20 10.41
> >10.25 0.49 0.21 0.27 0.20 0.09 20.28
> >sdn 0.00 0.00 929.53 2825.83 3.63 15.82
> >10.61 0.87 0.23 0.35 0.19 0.08 31.47
> >sdo 0.00 0.00 310.53 1798.07 1.21 9.61
> >10.51 0.42 0.20 0.24 0.20 0.09 19.85
> >sdr 0.00 0.00 918.30 2887.80 3.59 15.82
> >10.44 0.81 0.21 0.20 0.22 0.08 29.68
> >sds 0.00 0.00 314.67 1773.30 1.23 9.61
> >10.63 0.44 0.21 0.20 0.21 0.09 19.71
> >sdu 0.00 0.00 307.03 1992.40 1.20 10.48
> >10.40 0.50 0.22 0.18 0.22 0.09 20.44
> >sdt 0.00 0.00 930.23 2894.17 3.63 15.85
> >10.43 0.83 0.22 0.20 0.22 0.08 30.05
> >sdy 0.00 0.00 310.40 1994.33 1.21 10.48
> >10.39 0.49 0.21 0.18 0.22 0.09 19.88
> >sdx 0.00 0.00 936.17 2887.07 3.66 15.85
> >10.45 0.83 0.22 0.21 0.22 0.08 30.64
> >sdw 0.00 0.00 314.67 1778.27 1.23 9.63
> >10.63 0.44 0.21 0.21 0.21 0.09 19.77
> >sdv 0.00 0.00 928.20 2899.53 3.63 15.86
> >10.43 0.83 0.22 0.20 0.22 0.08 29.89
> >zd0 0.00 0.00 2.77 9427.73 0.01 36.83
> >8.00 6.90 0.73 2.31 0.73 0.07 70.43
> >
> >If you add up the numbers, the wear amplification is better at 8.43:1 or
> >10.34:1 for the hottest drive, but this is still quite bad. Even worse,
> >the write IOPS is now down to 9427 and 37 MB/sec. Remember that this
> array
> >can write linearly at around 6000 MB/sec.
> >
> >I suspect my results are skewed by zvols. Then again, they probably match
> >in-place updates inside of a single file. So if you are running a
> database
> >you will probably see these numbers.
> >
> >ZFS has some amazing features. It just looks like it's design decisions
> >were not made with SSDs in mind.
> >
> >I have posted the 'bm-flash' binary to:
> >
> >
> https://drive.google.com/file/d/0B3T4AZzjEGVkemkzWlhubmNmb0E/edit?usp=sharing
> >
> >I will post the source in a couple of days (if anyone is interested).
> >
> >Doug Dumitru
> >EasyCo LLC
> >
> >To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-01 19:11:24 UTC
Permalink
On Friday, November 1, 2013 11:22:11 AM UTC-7, Gordan Bobic wrote:
>
> Err... 23-disk RAID5 is way beyond silly, both in terms of performance and
> reliability.


I understand the argument in terms of reliability. Then again, with SSDs,
the rebuild time is quite fast, so the multi-disk error window is lower.

In terms of performance it depends on the data patterns going to the
array. My applications all write perfect raid stripes and perfect long
blocks that match flash erase block boundaries. Thus write performance is
"wickedly good".


Doug Dumitru
EasyCo LLC


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Schlacta, Christ
2013-11-01 19:27:34 UTC
Permalink
22*4k means your application always writes 88k at a time atomically?
Wait.. flash erase is 8k. 22*8k is 176k. Even so, zfs well break it up
into 128k, and 64k padded. Because internal limitations of almost all
filesystems.

If you write 4k blocks on an ashift=12 pool, each write will be one data
block plus parity, plus metadata. With raid z2 that's at least
1data+2parity+2meta =5 before walking the merkle tree.

You are seriously making a huge mistake writing 4k blocks to a raidz2 pool
with 22+2 disks.

Either write larger blocks to smaller vdevs, or accept massive
amplification.

4+2= 16k or 32k.
2+1= 8k or 16k.
8+2= 32k or 64k.

The only way 4k writes will ever be the right option for efficiency of
space and writes is on mirrored pairs.
On Nov 1, 2013 12:11 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

>
>
> On Friday, November 1, 2013 11:22:11 AM UTC-7, Gordan Bobic wrote:
>>
>> Err... 23-disk RAID5 is way beyond silly, both in terms of performance
>> and reliability.
>
>
> I understand the argument in terms of reliability. Then again, with SSDs,
> the rebuild time is quite fast, so the multi-disk error window is lower.
>
> In terms of performance it depends on the data patterns going to the
> array. My applications all write perfect raid stripes and perfect long
> blocks that match flash erase block boundaries. Thus write performance is
> "wickedly good".
>
>
> Doug Dumitru
> EasyCo LLC
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-01 19:37:36 UTC
Permalink
On Friday, November 1, 2013 12:27:34 PM UTC-7, Christ Schlacta wrote:
>
> 22*4k means your application always writes 88k at a time atomically?
> Wait.. flash erase is 8k. 22*8k is 176k. Even so, zfs well break it up
> into 128k, and 64k padded. Because internal limitations of almost all
> filesystems.
>
> If you write 4k blocks on an ashift=12 pool, each write will be one data
> block plus parity, plus metadata. With raid z2 that's at least
> 1data+2parity+2meta =5 before walking the merkle tree.
>
> You are seriously making a huge mistake writing 4k blocks to a raidz2 pool
> with 22+2 disks.
>
> Either write larger blocks to smaller vdevs, or accept massive
> amplification.
>
> 4+2= 16k or 32k.
> 2+1= 8k or 16k.
> 8+2= 32k or 64k.
>
> The only way 4k writes will ever be the right option for efficiency of
> space and writes is on mirrored pairs.
>
I am beginning to suspect that this is the case. I re-ran my tests as 6 4
drive raid-z1 sets:

+ modprobe zfs zfs_arc_min=1073741824 zfs_arc_max=8589934592
+ zpool create data raidz1 /dev/sdb /dev/sdc /dev/sdd /dev/sde raidz1
/dev/sdf /dev/sdg /dev/sdh /dev/sdi raidz1 /dev/sdj /dev/sdk /dev/sdl
/dev/sdm raidz1 /dev/sdn /dev/sdo /dev/sdp /dev/sdq raidz1 /dev/sdr
/dev/sds /dev/sdt /dev/sdu raidz1 /dev/sdv /dev/sdw /dev/sdx /dev/sdy -f -o
ashift=12
+ zfs create -V 300G data/test
+ zfs set logbias=throughput data/test

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 727.07 475.00 2.84 2.67
9.38 0.21 0.17 0.17 0.17 0.11 12.87
sdb 0.00 0.00 2027.73 3524.83 7.92 24.63
12.01 0.95 0.17 0.19 0.16 0.07 39.67
sdd 0.00 0.00 2097.47 3502.10 8.19 24.63
12.01 0.96 0.17 0.20 0.16 0.07 40.45
sde 0.00 0.00 656.73 3119.67 2.57 22.22
13.44 0.60 0.16 0.18 0.16 0.06 21.56
sdf 0.00 0.00 2033.37 3909.43 7.94 24.22
11.08 1.04 0.18 0.20 0.17 0.07 40.85
sdh 0.00 0.00 2078.67 3912.10 8.12 24.21
11.05 1.05 0.18 0.20 0.17 0.07 41.43
sdg 0.00 0.00 714.50 1847.47 2.79 12.58
12.29 0.42 0.16 0.18 0.16 0.07 18.03
sdi 0.00 0.00 668.10 2116.60 2.61 11.87
10.65 0.48 0.17 0.19 0.17 0.07 18.28
sdp 0.00 0.00 2186.20 3769.73 8.54 24.19
11.25 1.09 0.18 0.26 0.14 0.07 43.72
sdq 0.00 0.00 578.50 1774.63 2.26 10.90
11.46 0.34 0.14 0.20 0.13 0.06 15.08
sdj 0.00 0.00 2109.27 3700.73 8.24 24.62
11.58 1.16 0.20 0.29 0.15 0.08 45.00
sdk 0.00 0.00 662.57 2697.27 2.59 16.40
11.57 0.52 0.16 0.22 0.14 0.06 19.48
sdl 0.00 0.00 2034.93 3705.43 7.95 24.62
11.62 1.16 0.20 0.30 0.15 0.08 44.92
sdm 0.00 0.00 737.17 1306.17 2.88 8.45
11.36 0.33 0.16 0.18 0.15 0.08 15.36
sdn 0.00 0.00 1971.43 3799.23 7.70 24.19
11.32 1.03 0.18 0.26 0.13 0.07 42.17
sdo 0.00 0.00 793.47 2177.10 3.10 13.52
11.46 0.45 0.15 0.18 0.14 0.06 18.63
sdr 0.00 0.00 1940.23 3429.63 7.58 24.75
12.33 0.95 0.18 0.19 0.17 0.07 38.72
sds 0.00 0.00 824.23 1510.57 3.22 10.76
12.26 0.39 0.17 0.18 0.16 0.08 18.28
sdu 0.00 0.00 553.53 1981.27 2.16 14.23
13.25 0.43 0.17 0.17 0.17 0.06 16.25
sdt 0.00 0.00 2210.87 3421.73 8.64 24.76
12.14 1.01 0.18 0.20 0.17 0.07 41.04
sdy 0.00 0.00 801.83 2338.87 3.13 15.23
11.97 0.55 0.17 0.19 0.17 0.07 21.67
sdx 0.00 0.00 1967.53 3915.73 7.69 24.52
11.21 1.05 0.18 0.21 0.17 0.07 40.96
sdw 0.00 0.00 579.40 1651.80 2.26 9.53
10.82 0.38 0.17 0.19 0.17 0.07 15.27
sdv 0.00 0.00 2190.80 3921.77 8.56 24.51
11.08 1.10 0.18 0.20 0.17 0.07 43.19
zd0 0.00 0.00 0.00 11710.23 0.00 45.74
8.00 9.91 0.85 0.00 0.85 0.09 100.00

The performance and wear are a little worse than a single big raid-z1 set.

I will run a mirrored set and post the results in a few minutes. Mirroring
on SSDs is problematic in that you are trying to optimize $/GB.

Doug Dumitru
EasyCo LLC

> On Nov 1, 2013 12:11 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org<javascript:>>
> wrote:
>
>>
>>
>> On Friday, November 1, 2013 11:22:11 AM UTC-7, Gordan Bobic wrote:
>>>
>>> Err... 23-disk RAID5 is way beyond silly, both in terms of performance
>>> and reliability.
>>
>>
>> I understand the argument in terms of reliability. Then again, with
>> SSDs, the rebuild time is quite fast, so the multi-disk error window is
>> lower.
>>
>> In terms of performance it depends on the data patterns going to the
>> array. My applications all write perfect raid stripes and perfect long
>> blocks that match flash erase block boundaries. Thus write performance is
>> "wickedly good".
>>
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>>
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-01 19:53:57 UTC
Permalink
Results for 12 2-drive mirrors:

+ modprobe zfs zfs_arc_min=1073741824 zfs_arc_max=8589934592
+ zpool create data mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde
mirror /dev/sdf /dev/sdh mirror /dev/sdg /dev/sdi mirror /dev/sdj /dev/sdk
mirror /dev/sdl /dev/sdm mirror /dev/sdn /dev/sdo mirror /dev/sdp /dev/sdq
mirror /dev/sdr /dev/sds mirror /dev/sdt /dev/sdu mirror /dev/sdv /dev/sdw
mirror /dev/sdx /dev/sdy -f -o ashift=12
+ zfs create -V 300G data/test
+ zfs set logbias=throughput data/test

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 735.57 1692.90 5.72 25.24
26.11 0.60 0.25 0.25 0.25 0.10 25.13
sdb 0.00 0.00 732.47 1692.87 5.69 25.24
26.12 0.57 0.24 0.20 0.25 0.09 21.89
sdd 0.00 0.00 734.57 1592.77 5.71 25.49
27.46 0.53 0.23 0.20 0.24 0.09 21.85
sde 0.00 0.00 731.43 1595.90 5.69 25.49
27.44 0.57 0.24 0.24 0.24 0.11 24.88
sdf 0.00 0.00 744.37 1564.40 5.79 25.80
28.02 0.55 0.24 0.20 0.26 0.10 22.09
sdh 0.00 0.00 736.67 1572.83 5.73 25.80
27.96 0.59 0.26 0.26 0.26 0.11 25.68
sdg 0.00 0.00 737.47 1401.60 5.73 25.60
30.00 0.51 0.24 0.20 0.26 0.10 21.44
sdi 0.00 0.00 738.13 1408.40 5.73 25.60
29.89 0.54 0.25 0.24 0.26 0.11 24.36
sdp 0.00 0.00 740.77 1461.00 5.76 25.35
28.94 0.56 0.25 0.32 0.22 0.11 25.29
sdq 0.00 0.00 734.57 1462.77 5.71 25.35
28.95 0.59 0.27 0.34 0.23 0.12 27.27
sdj 0.00 0.00 729.73 1621.53 5.68 25.25
26.94 0.55 0.23 0.28 0.21 0.10 24.24
sdk 0.00 0.00 728.80 1632.70 5.66 25.25
26.81 0.56 0.24 0.30 0.21 0.11 26.16
sdl 0.00 0.00 737.60 1684.70 5.73 25.58
26.48 0.55 0.23 0.27 0.21 0.10 24.27
sdm 0.00 0.00 741.47 1694.37 5.76 25.58
26.35 0.57 0.23 0.30 0.20 0.11 26.33
sdn 0.00 0.00 737.40 1659.43 5.74 25.74
26.89 0.52 0.22 0.26 0.20 0.10 24.20
sdo 0.00 0.00 738.47 1663.77 5.74 25.74
26.84 0.55 0.23 0.30 0.20 0.11 26.27
sdr 0.00 0.00 739.03 1277.57 5.75 25.61
31.84 0.51 0.25 0.20 0.28 0.11 21.52
sds 0.00 0.00 734.13 1282.90 5.71 25.61
31.80 0.55 0.27 0.24 0.29 0.12 24.52
sdu 0.00 0.00 738.47 1383.67 5.74 25.36
30.02 0.56 0.26 0.24 0.28 0.12 24.67
sdt 0.00 0.00 737.13 1366.57 5.73 25.36
30.27 0.53 0.25 0.20 0.28 0.10 21.77
sdy 0.00 0.00 738.63 1649.53 5.74 25.01
26.37 0.60 0.25 0.25 0.26 0.11 25.19
sdx 0.00 0.00 735.40 1648.43 5.71 25.01
26.39 0.57 0.24 0.21 0.25 0.09 22.35
sdw 0.00 0.00 738.40 1608.93 5.74 24.93
26.76 0.59 0.25 0.25 0.26 0.11 25.20
sdv 0.00 0.00 741.43 1599.47 5.77 24.93
26.86 0.56 0.24 0.21 0.26 0.10 22.69
zd0 0.00 0.00 0.00 12446.23 0.00 48.62
8.00 9.89 0.80 0.00 0.80 0.08 100.00

Wear Amplification continues to get worse (12.5:1 here). It just looks
like 4K random writes in-place to an existing volume (and probably file) is
just not what ZFS does well.

Doug Dumitru
EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
s***@public.gmane.org
2013-11-03 01:41:29 UTC
Permalink
did you set the volblocksize=4K with mirror test?



On Friday, November 1, 2013 8:53:57 PM UTC+1, Doug Dumitru wrote:
>
> Results for 12 2-drive mirrors:
>
> + modprobe zfs zfs_arc_min=1073741824 zfs_arc_max=8589934592
> + zpool create data mirror /dev/sdb /dev/sdc mirror /dev/sdd /dev/sde
> mirror /dev/sdf /dev/sdh mirror /dev/sdg /dev/sdi mirror /dev/sdj /dev/sdk
> mirror /dev/sdl /dev/sdm mirror /dev/sdn /dev/sdo mirror /dev/sdp /dev/sdq
> mirror /dev/sdr /dev/sds mirror /dev/sdt /dev/sdu mirror /dev/sdv /dev/sdw
> mirror /dev/sdx /dev/sdy -f -o ashift=12
> + zfs create -V 300G data/test
> + zfs set logbias=throughput data/test
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-0 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-1 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-3 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-4 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> sdc 0.00 0.00 735.57 1692.90 5.72 25.24
> 26.11 0.60 0.25 0.25 0.25 0.10 25.13
> sdb 0.00 0.00 732.47 1692.87 5.69 25.24
> 26.12 0.57 0.24 0.20 0.25 0.09 21.89
> sdd 0.00 0.00 734.57 1592.77 5.71 25.49
> 27.46 0.53 0.23 0.20 0.24 0.09 21.85
> sde 0.00 0.00 731.43 1595.90 5.69 25.49
> 27.44 0.57 0.24 0.24 0.24 0.11 24.88
> sdf 0.00 0.00 744.37 1564.40 5.79 25.80
> 28.02 0.55 0.24 0.20 0.26 0.10 22.09
> sdh 0.00 0.00 736.67 1572.83 5.73 25.80
> 27.96 0.59 0.26 0.26 0.26 0.11 25.68
> sdg 0.00 0.00 737.47 1401.60 5.73 25.60
> 30.00 0.51 0.24 0.20 0.26 0.10 21.44
> sdi 0.00 0.00 738.13 1408.40 5.73 25.60
> 29.89 0.54 0.25 0.24 0.26 0.11 24.36
> sdp 0.00 0.00 740.77 1461.00 5.76 25.35
> 28.94 0.56 0.25 0.32 0.22 0.11 25.29
> sdq 0.00 0.00 734.57 1462.77 5.71 25.35
> 28.95 0.59 0.27 0.34 0.23 0.12 27.27
> sdj 0.00 0.00 729.73 1621.53 5.68 25.25
> 26.94 0.55 0.23 0.28 0.21 0.10 24.24
> sdk 0.00 0.00 728.80 1632.70 5.66 25.25
> 26.81 0.56 0.24 0.30 0.21 0.11 26.16
> sdl 0.00 0.00 737.60 1684.70 5.73 25.58
> 26.48 0.55 0.23 0.27 0.21 0.10 24.27
> sdm 0.00 0.00 741.47 1694.37 5.76 25.58
> 26.35 0.57 0.23 0.30 0.20 0.11 26.33
> sdn 0.00 0.00 737.40 1659.43 5.74 25.74
> 26.89 0.52 0.22 0.26 0.20 0.10 24.20
> sdo 0.00 0.00 738.47 1663.77 5.74 25.74
> 26.84 0.55 0.23 0.30 0.20 0.11 26.27
> sdr 0.00 0.00 739.03 1277.57 5.75 25.61
> 31.84 0.51 0.25 0.20 0.28 0.11 21.52
> sds 0.00 0.00 734.13 1282.90 5.71 25.61
> 31.80 0.55 0.27 0.24 0.29 0.12 24.52
> sdu 0.00 0.00 738.47 1383.67 5.74 25.36
> 30.02 0.56 0.26 0.24 0.28 0.12 24.67
> sdt 0.00 0.00 737.13 1366.57 5.73 25.36
> 30.27 0.53 0.25 0.20 0.28 0.10 21.77
> sdy 0.00 0.00 738.63 1649.53 5.74 25.01
> 26.37 0.60 0.25 0.25 0.26 0.11 25.19
> sdx 0.00 0.00 735.40 1648.43 5.71 25.01
> 26.39 0.57 0.24 0.21 0.25 0.09 22.35
> sdw 0.00 0.00 738.40 1608.93 5.74 24.93
> 26.76 0.59 0.25 0.25 0.26 0.11 25.20
> sdv 0.00 0.00 741.43 1599.47 5.77 24.93
> 26.86 0.56 0.24 0.21 0.26 0.10 22.69
> zd0 0.00 0.00 0.00 12446.23 0.00 48.62
> 8.00 9.89 0.80 0.00 0.80 0.08 100.00
>
> Wear Amplification continues to get worse (12.5:1 here). It just looks
> like 4K random writes in-place to an existing volume (and probably file) is
> just not what ZFS does well.
>
> Doug Dumitru
> EasyCo LLC
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-03 13:58:32 UTC
Permalink
Am 03.11.2013 02:41, schrieb szabo2424-***@public.gmane.org:
> did you set the volblocksize=4K with mirror test?
This.

> On Friday, November 1, 2013 8:53:57 PM UTC+1, Doug Dumitru wrote:
>
> Results for 12 2-drive mirrors:
>
> + modprobe zfs zfs_arc_min=1073741824 zfs_arc_max=8589934592
> + zpool create data mirror /dev/sdb /dev/sdc mirror /dev/sdd
> /dev/sde mirror /dev/sdf /dev/sdh mirror /dev/sdg /dev/sdi mirror
> /dev/sdj /dev/sdk mirror /dev/sdl /dev/sdm mirror /dev/sdn
> /dev/sdo mirror /dev/sdp /dev/sdq mirror /dev/sdr /dev/sds mirror
> /dev/sdt /dev/sdu mirror /dev/sdv /dev/sdw mirror /dev/sdx
> /dev/sdy -f -o ashift=12
> + zfs create -V 300G data/test
> + zfs set logbias=throughput data/test
>
> Wear Amplification continues to get worse (12.5:1 here). It just
> looks like 4K random writes in-place to an existing volume (and
> probably file) is just not what ZFS does well.
>

I suspect that your benchmark issues fsync on each request. Could you
try this with sync=disabled?

Gregor

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-03 19:10:17 UTC
Permalink
The test does io with O_DIRECT on the file so no write buffering. Writes
are done q=10, so there is overlap. This is how most iSCSI volume exports
work, so these are the performance numbers that I need. Buffering writes
in RAM is not a test of a block device.

With mirroring and volblocksize=4k, the linear speed of the volume was cut
in half. 4K random writes went from 14,394 IOPS to 15,492 IOPS and write
amplification went from 5.75:1 to 5.15:1, so this helps, but not a lot and
at a high cost for large IOs. I also suspect my RAM arc cache efficiency
gets really bad.

Just for fun I ran a small 'postmark' bencmark. Again, I want to test the
device, so I had buffering turned off. This was ashift=12 striping all 24
drives (no raid, no mirroring, no safety just speed)

PostMark v1.5 : 3/27/01
pm>pm>pm>pm>pm>pm>Current configuration is:
The base number of files is 50000
Transactions: 50000
Files range between 512 bytes and 500.00 kilobytes in size
Working directory:
/data/test (weight=1)
Block sizes are: read=512 bytes, write=512 bytes
Biases are: read/append=5, create/delete=5
Not using Unix buffered file I/O
Random number generator seed is 42
Report format is verbose.
pm>Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
477 seconds total
292 seconds of transactions (171 per second)

Files:
75086 created (157 per second)
Creation alone: 50000 files (284 per second)
Mixed with transactions: 25086 files (85 per second)
25022 read (85 per second)
24978 appended (85 per second)
75086 deleted (157 per second)
Deletion alone: 50172 files (5574 per second)
Mixed with transactions: 24914 files (85 per second)

Data:
6773.66 megabytes read (14.20 megabytes per second)
21120.71 megabytes written (44.28 megabytes per second)

Comparing the 21120.71 "megabytes written" to the contents of
/proc/diskstats across the run, the wear amplification is about 1.37:1,
which is not too bad. So it looks like the "in place update" nature of
zvols is the issue. I have not tested this, but I suspect in-place updates
inside of a single large file will also be quite bad. So much for
databases.

Just for comparisons, I ran the same postmark test on the 24 drives
configured raid-0 with an ext4 file system. The results are:

Deleting files...Done
Time:
48 seconds total
22 seconds of transactions (2272 per second)

Files:
75086 created (1564 per second)
Creation alone: 50000 files (2083 per second)
Mixed with transactions: 25086 files (1140 per second)
25022 read (1137 per second)
24978 appended (1135 per second)
75086 deleted (1564 per second)
Deletion alone: 50172 files (25086 per second)
Mixed with transactions: 24914 files (1132 per second)

Data:
6773.66 megabytes read (141.12 megabytes per second)
21120.71 megabytes written (440.01 megabytes per second)

Looking at /proc/diskstats, the total bytes written for md0 is 18,044 MB
(actually lower than postmark reports). This implies that the write
amplification for ZFS is actually 1.61:1 (still not that bad).

It is more than a little interesting that ext4 in this case is 10x faster
than ZFS.

Doug Dumitru
EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-03 19:29:16 UTC
Permalink
On 11/03/2013 07:10 PM, Doug Dumitru wrote:
> The test does io with O_DIRECT on the file so no write buffering.
> Writes are done q=10, so there is overlap. This is how most iSCSI
> volume exports work, so these are the performance numbers that I need.
> Buffering writes in RAM is not a test of a block device.
>
> With mirroring and volblocksize=4k, the linear speed of the volume was
> cut in half. 4K random writes went from 14,394 IOPS to 15,492 IOPS and
> write amplification went from 5.75:1 to 5.15:1, so this helps, but not a
> lot and at a high cost for large IOs. I also suspect my RAM arc cache
> efficiency gets really bad.
>
> Just for fun I ran a small 'postmark' bencmark. Again, I want to test
> the device, so I had buffering turned off. This was ashift=12 striping
> all 24 drives (no raid, no mirroring, no safety just speed)
>
> PostMark v1.5 : 3/27/01
> pm>pm>pm>pm>pm>pm>Current configuration is:
> The base number of files is 50000
> Transactions: 50000
> Files range between 512 bytes and 500.00 kilobytes in size
> Working directory:
> /data/test (weight=1)
> Block sizes are: read=512 bytes, write=512 bytes
> Biases are: read/append=5, create/delete=5
> Not using Unix buffered file I/O
> Random number generator seed is 42
> Report format is verbose.
> pm>Creating files...Done
> Performing transactions..........Done
> Deleting files...Done
> Time:
> 477 seconds total
> 292 seconds of transactions (171 per second)
>
> Files:
> 75086 created (157 per second)
> Creation alone: 50000 files (284 per second)
> Mixed with transactions: 25086 files (85 per second)
> 25022 read (85 per second)
> 24978 appended (85 per second)
> 75086 deleted (157 per second)
> Deletion alone: 50172 files (5574 per second)
> Mixed with transactions: 24914 files (85 per second)
>
> Data:
> 6773.66 megabytes read (14.20 megabytes per second)
> 21120.71 megabytes written (44.28 megabytes per second)
>
> Comparing the 21120.71 "megabytes written" to the contents of
> /proc/diskstats across the run, the wear amplification is about 1.37:1,
> which is not too bad. So it looks like the "in place update" nature of
> zvols is the issue. I have not tested this, but I suspect in-place
> updates inside of a single large file will also be quite bad. So much
> for databases.
>
> Just for comparisons, I ran the same postmark test on the 24 drives
> configured raid-0 with an ext4 file system. The results are:
>
> Deleting files...Done
> Time:
> 48 seconds total
> 22 seconds of transactions (2272 per second)
>
> Files:
> 75086 created (1564 per second)
> Creation alone: 50000 files (2083 per second)
> Mixed with transactions: 25086 files (1140 per second)
> 25022 read (1137 per second)
> 24978 appended (1135 per second)
> 75086 deleted (1564 per second)
> Deletion alone: 50172 files (25086 per second)
> Mixed with transactions: 24914 files (1132 per second)
>
> Data:
> 6773.66 megabytes read (141.12 megabytes per second)
> 21120.71 megabytes written (440.01 megabytes per second)
>
> Looking at /proc/diskstats, the total bytes written for md0 is 18,044 MB
> (actually lower than postmark reports). This implies that the write
> amplification for ZFS is actually 1.61:1 (still not that bad).
>
> It is more than a little interesting that ext4 in this case is 10x
> faster than ZFS.

Have you tried this with ext4 on LVM with a snapshot taken before you
ran the test? That would be more comparable.

Gordan

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-03 19:56:16 UTC
Permalink
> Have you tried this with ext4 on LVM with a snapshot taken before you
> ran the test? That would be more comparable.
>

With LVM agreed. Whether there is a snapshot is debatable as to
"comparable". After all most LVM users don't have snapshots active all the
time.

Regardless, ext4 inside of a 300GB logical volume with a 50GB snapshot
volume. The Postmark run-time increases from 48 to 210 seconds. A lot
slower but still 2x faster than zfs.

Without the snapshot, the run-time is 51 seconds.

My experience with LVM, at least in the simple cases, is that it is quite
low overhead. The only exception here is single threaded IO reads from
very fast SSDs. The latency is then the issue. Then again, ZFS in this
case is much worse than LVM. After all, when a drive turns around a read
in 100uS, it is hard to do "anything" to the IO or even insert an io
completion routine and not slow this down.

Doug Dumitru
EasyCo LLC


>
> Gordan
>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-03 20:02:39 UTC
Permalink
... even with postmark buffering allowed, the ZFS test runs in 64 seconds.
ext4 thru LVM (no snapshot) is 20 seconds.

Doug Dumitru
EasyCo LLC

On Sunday, November 3, 2013 11:56:16 AM UTC-8, Doug Dumitru wrote:
>
>
> Have you tried this with ext4 on LVM with a snapshot taken before you
>> ran the test? That would be more comparable.
>>
>
> With LVM agreed. Whether there is a snapshot is debatable as to
> "comparable". After all most LVM users don't have snapshots active all the
> time.
>
> Regardless, ext4 inside of a 300GB logical volume with a 50GB snapshot
> volume. The Postmark run-time increases from 48 to 210 seconds. A lot
> slower but still 2x faster than zfs.
>
> Without the snapshot, the run-time is 51 seconds.
>
> My experience with LVM, at least in the simple cases, is that it is quite
> low overhead. The only exception here is single threaded IO reads from
> very fast SSDs. The latency is then the issue. Then again, ZFS in this
> case is much worse than LVM. After all, when a drive turns around a read
> in 100uS, it is hard to do "anything" to the IO or even insert an io
> completion routine and not slow this down.
>
> Doug Dumitru
> EasyCo LLC
>
>
>>
>> Gordan
>>
>>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-03 21:57:26 UTC
Permalink
Doug,

> The test does io with O_DIRECT on the file so no write buffering.
> Writes are done q=10, so there is overlap. This is how most iSCSI
> volume exports work, so these are the performance numbers that I
> need. Buffering writes in RAM is not a test of a block device.
With the setup you have you get what you ask for: syncing every request
to disk, which is slow and inflates writes.
My comment was that should you want less writes to the drives then you
should use the features ZFS offers for this (ZIL).
Since your point was about wear amplification of various filesystems...

Gregor

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-03 22:54:54 UTC
Permalink
"The test does io with O_DIRECT on the file so no write buffering. Writes
are done q=10, so there is overlap. This is how most iSCSI volume exports
work, so these are the performance numbers that I need. Buffering writes
in RAM is not a test of a block device."

No, but you're benchmarking ZFS, and ZFS does exactly this - buffers writes
to RAM and then sequentially writes them down to disk every X seconds. Even
when you specify O_DIRECT. Separate but related to that workflow is the
'ZIL', or ZFS Intent Log, which wants to write any sync writes (eg: your
O_DIRECT setting) to drive immediately before responding to the client, but
not to the data drives, no, but to this 'intent log'. In the absence of any
'log' devices in the pool, it opts to use a portion of the space on the
normal data drives to fulfill this requirement.

You've also shown a significant amount of confusion around block size,
parity use, and so on. You need to read up and be somewhat experienced with
how ZFS operates before running benchmarks on it, *especially* when you're
looking for such specific and detailed results.. or both your results and
your conclusions from them will be wrong.

- Andrew


On Sun, Nov 3, 2013 at 1:57 PM, Gregor Kopka <gregor-***@public.gmane.org> wrote:

> Doug,
>
>
> The test does io with O_DIRECT on the file so no write buffering. Writes
>> are done q=10, so there is overlap. This is how most iSCSI volume exports
>> work, so these are the performance numbers that I need. Buffering writes
>> in RAM is not a test of a block device.
>>
> With the setup you have you get what you ask for: syncing every request to
> disk, which is slow and inflates writes.
> My comment was that should you want less writes to the drives then you
> should use the features ZFS offers for this (ZIL).
> Since your point was about wear amplification of various filesystems...
>
> Gregor
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 01:46:10 UTC
Permalink
Andrew,

My tests without O_DIRECT are only marginally better. My understanding is
that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).

My overall take on ZFS is somewhat harsh when considering large puse SSD
arrays.

* ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
1.5GB/sec of linear IO.
* ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
stripe sets, to over 20:1 with triple raid parity. This likely gets worse
for full pools.

If you think this is in error, then please let me know. I am not saying
that ZFS is "bad", it is just designed to address a different set of
problems in a different run-time environment.

Doug Dumitru
EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Alex Narayan
2013-11-04 01:48:58 UTC
Permalink
What brought you to ZFS in the first place? It sounds like ext4 and bcache
might be better or any of the filesystems.
On Nov 3, 2013 5:46 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> Andrew,
>
> My tests without O_DIRECT are only marginally better. My understanding is
> that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>
> My overall take on ZFS is somewhat harsh when considering large puse SSD
> arrays.
>
> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
> 1.5GB/sec of linear IO.
> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
> for full pools.
>
> If you think this is in error, then please let me know. I am not saying
> that ZFS is "bad", it is just designed to address a different set of
> problems in a different run-time environment.
>
> Doug Dumitru
> EasyCo LLC
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 02:01:07 UTC
Permalink
Alex,

I came to ZFS to learn. There is a "mime" that ZFS is "perfect" and the
"solution to everything". The documentation and commentary do not dispel
this. The only "performance limitation" I ever read about is that de-dupe
has memory issues. I have never seen any reference to SSD wear except in
terms of a ZIL log device.

So in the spirit of "don't assume, test", I decided to run some tests. I
was concerned about the numbers I was seeing, but did not want to push
forward without getting some expert verification. If my numbers are
misleading, then I need to be corrected. If they are not, then their
caveats need to become part of the FAQ. Nothing evil, just trying to see
what is real.

Doug Dumitru
EasyCo LLC

On Sunday, November 3, 2013 5:48:58 PM UTC-8, Alex Narayan wrote:
>
> What brought you to ZFS in the first place? It sounds like ext4 and bcache
> might be better or any of the filesystems.
> On Nov 3, 2013 5:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org<javascript:>>
> wrote:
>
>> Andrew,
>>
>> My tests without O_DIRECT are only marginally better. My understanding
>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>
>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>> arrays.
>>
>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
>> 1.5GB/sec of linear IO.
>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>> for full pools.
>>
>> If you think this is in error, then please let me know. I am not saying
>> that ZFS is "bad", it is just designed to address a different set of
>> problems in a different run-time environment.
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Alex Narayan
2013-11-04 02:54:23 UTC
Permalink
Doug,

I appreciate your research. I've learned things about ZFS that I didn't
know before.

I didn't mean to come off as demeaning. Would mirrored SSDs be way too
expensive ? Perhaps a fork of ZFS to adjust the code to suit SSDs better
On Nov 3, 2013 6:01 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> Alex,
>
> I came to ZFS to learn. There is a "mime" that ZFS is "perfect" and the
> "solution to everything". The documentation and commentary do not dispel
> this. The only "performance limitation" I ever read about is that de-dupe
> has memory issues. I have never seen any reference to SSD wear except in
> terms of a ZIL log device.
>
> So in the spirit of "don't assume, test", I decided to run some tests. I
> was concerned about the numbers I was seeing, but did not want to push
> forward without getting some expert verification. If my numbers are
> misleading, then I need to be corrected. If they are not, then their
> caveats need to become part of the FAQ. Nothing evil, just trying to see
> what is real.
>
> Doug Dumitru
> EasyCo LLC
>
> On Sunday, November 3, 2013 5:48:58 PM UTC-8, Alex Narayan wrote:
>>
>> What brought you to ZFS in the first place? It sounds like ext4 and
>> bcache might be better or any of the filesystems.
>> On Nov 3, 2013 5:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org> wrote:
>>
>>> Andrew,
>>>
>>> My tests without O_DIRECT are only marginally better. My understanding
>>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>>
>>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>>> arrays.
>>>
>>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO
>>> or 1.5GB/sec of linear IO.
>>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>>> for full pools.
>>>
>>> If you think this is in error, then please let me know. I am not saying
>>> that ZFS is "bad", it is just designed to address a different set of
>>> problems in a different run-time environment.
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 04:08:27 UTC
Permalink
On Sunday, November 3, 2013 6:54:23 PM UTC-8, Alex Narayan wrote:
>
> Doug,
>
> I appreciate your research. I've learned things about ZFS that I didn't
> know before.
>
> I didn't mean to come off as demeaning. Would mirrored SSDs be way too
> expensive ? Perhaps a fork of ZFS to adjust the code to suit SSDs better
>
I have users that thinks SSDs are just fine, and other that think they are
outrageous. The "holy grail" for SSDs is to:

* use cheap, low endurance drives
* use as few "extra" cells as possible for redundancy
* use data reduction to reduce space usage even more
* be 100% reliable and last forever.

for some vendors, then means 20nm, three bit per cell drives, with custom
controllers, analog error processing, adaptive write algorithms that slow
down with drive wear (and things like temperature), multi-level error
detection, de-dupe, compression, another layer of error detection, another
layer of block-level error correction, thin provisioning, zero copy clones,
and a bunch of other stuff.

No, this is not open-source code, but it does exist. The trick is to do
all of this "AND" keep it fast.

Doug Dumitru
EasyCo LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 02:38:54 UTC
Permalink
I've run all-ssd arrays of up to 20 drives and not gotten results as
horrible as yours. I don't have enough time to dig into what the difference
is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
simultaneously), then exported/imported the pool to clear arc and read them
back at 5GB/s. Exporting/importing and then running random reads on the
files yielded ~600k iops.

I do still think that you need to increase your zvol block size, you
shouldn't be getting 5:1 writes unless you are fixing the record size to be
close to your ashift. You want your record size to be number of data disks
* ashift for raidzX. It may cause you to read a bit more if you always only
have small random io, but it shouldn't make too much of a difference. Arc
will help, but not with a benchmark designed to avoid it. Also, direct and
sync are two different things. As far as I remember zfs Linux doesn't even
support directio, but you can still issue syncs to ruin your io as much as
you want.

In the end each filesystem is a tradeoff. I recently saw that checksum
support was put into the kernel for xfs, which is good but it won't be able
to correct errors. I think other filesystem's might give me better numbers
in some cases, but I need the features.
On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> Andrew,
>
> My tests without O_DIRECT are only marginally better. My understanding is
> that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>
> My overall take on ZFS is somewhat harsh when considering large puse SSD
> arrays.
>
> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
> 1.5GB/sec of linear IO.
> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
> for full pools.
>
> If you think this is in error, then please let me know. I am not saying
> that ZFS is "bad", it is just designed to address a different set of
> problems in a different run-time environment.
>
> Doug Dumitru
> EasyCo LLC
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 03:32:46 UTC
Permalink
Heh, I just reread the OP and realized I had forgotten the original issue.
I think part of what you're seeing on amplification is block/record size
related, as mentioned, combined with a benchmark undermining zfs
transaction groups by artifically calling sync a ton. More basic
filesystems will perform better and write less under those circumstances,
but most real apps don't act that way, either.
On Nov 3, 2013 7:38 PM, "Marcus Sorensen" <shadowsor-***@public.gmane.org> wrote:

> I've run all-ssd arrays of up to 20 drives and not gotten results as
> horrible as yours. I don't have enough time to dig into what the difference
> is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
> simultaneously), then exported/imported the pool to clear arc and read them
> back at 5GB/s. Exporting/importing and then running random reads on the
> files yielded ~600k iops.
>
> I do still think that you need to increase your zvol block size, you
> shouldn't be getting 5:1 writes unless you are fixing the record size to be
> close to your ashift. You want your record size to be number of data disks
> * ashift for raidzX. It may cause you to read a bit more if you always only
> have small random io, but it shouldn't make too much of a difference. Arc
> will help, but not with a benchmark designed to avoid it. Also, direct and
> sync are two different things. As far as I remember zfs Linux doesn't even
> support directio, but you can still issue syncs to ruin your io as much as
> you want.
>
> In the end each filesystem is a tradeoff. I recently saw that checksum
> support was put into the kernel for xfs, which is good but it won't be able
> to correct errors. I think other filesystem's might give me better numbers
> in some cases, but I need the features.
> On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
> wrote:
>
>> Andrew,
>>
>> My tests without O_DIRECT are only marginally better. My understanding
>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>
>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>> arrays.
>>
>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
>> 1.5GB/sec of linear IO.
>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>> for full pools.
>>
>> If you think this is in error, then please let me know. I am not saying
>> that ZFS is "bad", it is just designed to address a different set of
>> problems in a different run-time environment.
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 04:11:49 UTC
Permalink
On Sunday, November 3, 2013 7:32:46 PM UTC-8, Marcus Sorensen wrote:
>
> Heh, I just reread the OP and realized I had forgotten the original issue.
> I think part of what you're seeing on amplification is block/record size
> related, as mentioned, combined with a benchmark undermining zfs
> transaction groups by artifically calling sync a ton. More basic
> filesystems will perform better and write less under those circumstances,
> but most real apps don't act that way, either.
>

I know my test is artificial, but it actually is similar to what big VDI
farms generate doing boot storms.

I suspect that "normal" file IO has this issues as well. I ran a
side-by-side postmark test and comparing bytes written on ZFS to bytes
written on ext4. This still shows 2:1 more writes with ZFS running striped
(no mirror, no raid). While this is not a lot, it is still an important
system design parameter.

Doug Dumitru
EasyCO LLC

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 09:46:10 UTC
Permalink
On 04/11/13 04:11, Doug Dumitru wrote:

> I know my test is artificial, but it actually is similar to what big VDI
> farms generate doing boot storms.

How often do those happen?


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-04 10:01:52 UTC
Permalink
I would have thought that boot storms are read intensive, rather than write
intensive.


On Mon, Nov 4, 2013 at 9:46 AM, Uncle Stoatwarbler <stoatwblr-***@public.gmane.org>wrote:

> On 04/11/13 04:11, Doug Dumitru wrote:
>
> I know my test is artificial, but it actually is similar to what big VDI
>> farms generate doing boot storms.
>>
>
> How often do those happen?
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 10:15:26 UTC
Permalink
On 04/11/13 10:01, Gordan Bobic wrote:
> I would have thought that boot storms are read intensive, rather than
> write intensive.
>

No matter what they are, they're usually rare.

There's not a lot of point in trying to overly optimize for corner cases
if it hurts overall performance, unless _not_ optimizing results in
extending the storms unduly.

I understand the worries about write amplification - I've been following
this thread with some interest because a ZFS SSD array is being proposed
for networked /home at $orkplace - but if ZFS is allowed to do its thing
(and that includes using ZIL even if it's counterintuitive for SSD
arrays(*)) then I'd like to see the results.

The tests being run by the OP seem designed to hobble ZFS's write
optimisations and then see how hard it can be driven with suboptimal
data streams. In such a case I'm not surprised that write amplification
is a major issue.


(*) SSDs do VERY badly at simultaneous R/W in general.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-04 09:06:47 UTC
Permalink
Am 04.11.2013 10:46, schrieb Uncle Stoatwarbler:
> On 04/11/13 04:11, Doug Dumitru wrote:
>
>> I know my test is artificial, but it actually is similar to what big VDI
>> farms generate doing boot storms.
>
> How often do those happen?

And shouldn't most of the images be clones (with low variations for
configuration) anyway, so the majority of data being pulled from disk is
requested by the first client and afterwards similar clients (except
/etc) will be served from ARC?

Gregor

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 10:18:09 UTC
Permalink
On 04/11/13 09:06, Gregor Kopka wrote:

> And shouldn't most of the images be clones (with low variations for
> configuration) anyway, so the majority of data being pulled from disk is
> requested by the first client and afterwards similar clients (except
> /etc) will be served from ARC?

The OP has no ARC or ZIL - theoretically in an all SSD array they're not
needed.

Personally I feel that a high speed ZIL drive is a minimum requirement,
especially if consumer-grade SSDs are being used for the main store - if
for no other reason than coalescing writes.





To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-04 09:20:49 UTC
Permalink
Am 04.11.2013 11:18, schrieb Uncle Stoatwarbler:
> On 04/11/13 09:06, Gregor Kopka wrote:
>
>> And shouldn't most of the images be clones (with low variations for
>> configuration) anyway, so the majority of data being pulled from disk is
>> requested by the first client and afterwards similar clients (except
>> /etc) will be served from ARC?
>
> The OP has no ARC or ZIL - theoretically in an all SSD array they're
> not needed.
I know. My (more or less rethoric) question was with a sane setup in
mind, not for an ill tuned synthetic benchmark.

Gregor

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 12:25:12 UTC
Permalink
On 04/11/13 09:20, Gregor Kopka wrote:

> I know. My (more or less rethoric) question was with a sane setup in
> mind, not for an ill tuned synthetic benchmark.

The problem with rethorical questions are many, but for today I'll point
out:

1: Many times they aren't obvious to native language speakers

2: If someone is obsessive-compulsively following a becnhmark the
importance of the question may not be obvious.


Personally I approach all benchmarks with a lot of suspicion because
except in a few corner cases they often don't reflect real world loads.

(Especially in areas such as general .edu fileservers where the load is
a mixture of everything you can think of and then some.)

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-04 12:30:37 UTC
Permalink
On Mon, Nov 4, 2013 at 12:25 PM, Uncle Stoatwarbler <stoatwblr-***@public.gmane.org>wrote:

>
> Personally I approach all benchmarks with a lot of suspicion because
> except in a few corner cases they often don't reflect real world loads.
>

This is probably the most important point on this thread so far.
1) Don't trust benchmarks unless you are running them yourself - and even
then don't assume it measures exactly what you think it does.
2) Put the system under a real load, not one you guess might approximate it
via some kind of a synthetic benchmark app.

For any other approach, you might as well roll dice and say that's what
your benchmark numbers are, as far as meaningfulness is concerned.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Ryan How
2013-11-04 11:40:26 UTC
Permalink
Would be interesting to see the effect of the ZIL in this situation.

On 4/11/2013 6:18 PM, Uncle Stoatwarbler wrote:
> Personally I feel that a high speed ZIL drive is a minimum
> requirement, especially if consumer-grade SSDs are being used for the
> main store - if for no other reason than coalescing writes.
>
>
>
>
>
> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-04 11:50:42 UTC
Permalink
On Mon, Nov 4, 2013 at 10:18 AM, Uncle Stoatwarbler <stoatwblr-***@public.gmane.org>wrote:

>
> Personally I feel that a high speed ZIL drive is a minimum requirement,
> especially if consumer-grade SSDs are being used for the main store - if
> for no other reason than coalescing writes.
>

I'm not entirely sure it would actually help. If your writes are mostly
4KB, but backing storage has default variable block size the multiple
writes _might_ get coalesced into 128KB writes due to ZIL. So far so good.
But if you are getting rewrites/modifications, as you do due to CoW, you
end up having to rewrite the whole block for a much smaller write, so you
still end up with the write-amplification (unless I'm misunderstanding how
this works).

On spinning rust this is a non-issue because a 128KB linear write will have
negligible performance penalty over a 4KB linear write, so it's a decent
tradeoff. With any kind of solid-state storage, however, this becomes very
expensive very quickly.

The only sensible solution I can see is to always match the block size to
the typical write size - in which case a ZIL isn't likely to help you,
unless you are running on very cheap, poorly optimized SSD on which linear
writes are massively faster than random writes (i.e. as is the case with
typical USB sticks or SD/CF cards).

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-04 12:35:58 UTC
Permalink
Am 04.11.2013 12:50, schrieb Gordan Bobic:
> On Mon, Nov 4, 2013 at 10:18 AM, Uncle Stoatwarbler
> <stoatwblr-***@public.gmane.org <mailto:stoatwblr-***@public.gmane.org>> wrote:
>
>
> Personally I feel that a high speed ZIL drive is a minimum
> requirement, especially if consumer-grade SSDs are being used for
> the main store - if for no other reason than coalescing writes.
>
>
> I'm not entirely sure it would actually help. If your writes are
> mostly 4KB, but backing storage has default variable block size the
> multiple writes _might_ get coalesced into 128KB writes due to ZIL. So
> far so good. But if you are getting rewrites/modifications, as you do
> due to CoW, you end up having to rewrite the whole block for a much
> smaller write, so you still end up with the write-amplification
> (unless I'm misunderstanding how this works).
>
> On spinning rust this is a non-issue because a 128KB linear write will
> have negligible performance penalty over a 4KB linear write, so it's a
> decent tradeoff. With any kind of solid-state storage, however, this
> becomes very expensive very quickly.
>
> The only sensible solution I can see is to always match the block size
> to the typical write size - in which case a ZIL isn't likely to help
> you, unless you are running on very cheap, poorly optimized SSD on
> which linear writes are massively faster than random writes (i.e. as
> is the case with typical USB sticks or SD/CF cards).

My gut feeling is that ZIL would reduce writes to higher-level metadata
(which would else need to be updated on each data block write when
benchmarking with fsync) - since the main point of the OP was wear
amplification, not performance...

Gregor

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Chris Siebenmann
2013-11-04 15:43:26 UTC
Permalink
| My gut feeling is that ZIL would reduce writes to higher-level metadata
| (which would else need to be updated on each data block write when
| benchmarking with fsync) - since the main point of the OP was wear
| amplification, not performance...

I think that people may have a mis-perception of the ZIL and how it
works. ZFS pools *always* have a ZIL; the only question is whether it is
internal (written to regular data vdevs) or external on a separate log
device. As far as I know you'll get the same write amplification almost
regardless of where the ZIL lives, it's just a question of which device
sees it.

(There are a whole bunch of tangled issues here that will require
performance tuning in a real pool, especially if your goal is to
separate out fsync()-forced writes from regular pool IO. The current
SLOG code is optimized for (theoretical) low latency, not write
diversion.)

- cks

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-04 17:28:08 UTC
Permalink
Am 04.11.2013 16:43, schrieb Chris Siebenmann:
> | My gut feeling is that ZIL would reduce writes to higher-level metadata
> | (which would else need to be updated on each data block write when
> | benchmarking with fsync) - since the main point of the OP was wear
> | amplification, not performance...
>
> I think that people may have a mis-perception of the ZIL and how it
> works. ZFS pools *always* have a ZIL; the only question is whether it is
> internal (written to regular data vdevs) or external on a separate log
> device. As far as I know you'll get the same write amplification almost
> regardless of where the ZIL lives, it's just a question of which device
> sees it.
>
> (There are a whole bunch of tangled issues here that will require
> performance tuning in a real pool, especially if your goal is to
> separate out fsync()-forced writes from regular pool IO. The current
> SLOG code is optimized for (theoretical) low latency, not write
> diversion.)

I know that there is always a ZIL, my point was to add a slog so the
sync writes to it (which will never be read except on desaster recovery
after unclean shutdown) won't hit the precious data disks so the wear
amplification on them gets down - even for the price that you have to
dump a slog drive from time to time.

Gregor

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 17:39:07 UTC
Permalink
On 04/11/13 15:43, Chris Siebenmann wrote:


> I think that people may have a mis-perception of the ZIL and how it
> works. ZFS pools *always* have a ZIL; the only question is whether it is
> internal (written to regular data vdevs) or external on a separate log
> device.

I'm aware of that.

For those not following this: if there is no ZIL device, ZIL data is
striped across the main vdev(s), then written to the main filesystem and
finally the ZIL stripe is removed, ensuring atomic data integrity but
resulting in at at least a 2:1 write amplification.

If you want to compare apples to apples on an ext4 filesystem, then it
must be mounted using "data=journal", otherwise the ext4 system only
journals metadata (data=ordered) and is susceptable to data loss in the
event of a power failure. The ext4 FS should also be mounted with the
journal_checksum parameter (which does what it says - creates and writes
journal checksums for added corruption resistance)


> As far as I know you'll get the same write amplification almost
> regardless of where the ZIL lives, it's just a question of which device
> sees it.

Agreed - no matter where the ZIL is located, you will see at _least_ 2:1
write amplification, but if you cripple ZFS layout it can end up a lot
worse.


The reasons for preferring a small, fast, possibly SLC, dedicated ZIL
device (or mirror if you're paranoid) are:

1: It is a lot easier to replace the ZIL than a bunch of larger drives
if it wears out (it can even be done on a running system!)

2: because of way the dedicated ZIL works, you're sending it sequential
writes, not random ones (longer life)

3: Assuming the device is not ridiculously small, it will coalesce most
of the 4kb writes into something meaningful (and mostly sequential) for
the main vdev, saving write-amplification issues there as well as
avoiding the IOPS degradation that a direct random RW scenario causes.

4: If zfs sync=always, as long as the ZIL device sequential write speed
is adequate it will keeping up with writes as they happen. No need for
o_direct games

and

5: It allows you to use cheaper, slower SSDs for the main vdev(s) than
if you try to achieve the same IO loading with direct writes/reads. For
this scenario the slowish vdev SSDs can be bracketed by high-performance
cache/zil devices (assuming CPU/RAM/Network can keep up too)


As always: YMMV.


It's worth noting that every single vendor pitching ZFS at my dayjob is
specifying 8Gb STEC ZeusRam drives for ZIL. These things are
hellaciously expensive but they're pretty much guaranteed to keep up.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-04 19:16:24 UTC
Permalink
I'm guilty of it sometimes myself, but can we all agree not to
interchangeably use the 'ZIL' acronym to stand for the ZFS Intent Log /and/
for log (v)devs? Log/slog devices are an optional add-on to a pool, that
take over the log write duty from the data vdevs for ZIL mechanics. Unless
you specifically disable it, the ZIL is always there, and is utilizing your
normal data drives if not given log devices.

And on the topic of that, the ZIL mechanics in how it handles pushing down
I/O is even worse than has so far been mentioned. Every time ZFS writes
data down as part of a ZIL write, it by default is going to also send a
CACHE SYNC command to the disk(s) used for the log. Usually the only time a
data disk sees a CACHE FLUSH command from ZFS is during txg commits, I
believe, but if they're pulling doubly duty for log traffic, suddenly
they're seeing them constantly. Only truly enterprise-grade SSD's with
non-volatile caches are unaffected by CACHE SYNC commands. They are
unaffected because they basically no-op (ignore) it, as they know they're
safe. SSD's with volatile caches should, by spec, obey this command, and
AFAIK it has an adverse effect on both their performance and their
longevity to be handed these basically with every write coming down the
pipe, which is much of what they'll see if told to act as a log device.

This is why stoatwblr's vendors are pushing ZeusRAM's at him - if they're
not, they'll push ZeusIOPS or possibly SMART Optimus at him. Only devices
that /safely/ no-op cache sync commands should be used for ZIL. And on the
topic of ZeusRAM's, yes, they're hellaciously expensive, and yes, they're
the very best device on the planet for ZFS log use (though they do have
other drawbacks beyond their cost -- they're also small; at least for now,
the log devices need to have space for approximately 3 txg's worth of data
on them at a time.. so if your txg_timeout is 10s, they need to be able to
hold 30s of data -- if your total ingress of log-utilizing writes in 30s is
> 7.5 GB or so, ZFS ends up having to force txg's early, which engages the
zfs write throttle mechanics, which almost always puts you in a place
you'll hate being in -- so you end up having to buy as many as it takes to
hold 30s worth of log capacity, at multiple 1000's of dollars per device;
that said, if performance is your goal, it is money /well spent/).


On Mon, Nov 4, 2013 at 9:39 AM, Uncle Stoatwarbler <stoatwblr-***@public.gmane.org>wrote:

> On 04/11/13 15:43, Chris Siebenmann wrote:
>
>
> I think that people may have a mis-perception of the ZIL and how it
>> works. ZFS pools *always* have a ZIL; the only question is whether it is
>> internal (written to regular data vdevs) or external on a separate log
>> device.
>>
>
> I'm aware of that.
>
> For those not following this: if there is no ZIL device, ZIL data is
> striped across the main vdev(s), then written to the main filesystem and
> finally the ZIL stripe is removed, ensuring atomic data integrity but
> resulting in at at least a 2:1 write amplification.
>
> If you want to compare apples to apples on an ext4 filesystem, then it
> must be mounted using "data=journal", otherwise the ext4 system only
> journals metadata (data=ordered) and is susceptable to data loss in the
> event of a power failure. The ext4 FS should also be mounted with the
> journal_checksum parameter (which does what it says - creates and writes
> journal checksums for added corruption resistance)
>
>
>
> As far as I know you'll get the same write amplification almost
>> regardless of where the ZIL lives, it's just a question of which device
>> sees it.
>>
>
> Agreed - no matter where the ZIL is located, you will see at _least_ 2:1
> write amplification, but if you cripple ZFS layout it can end up a lot
> worse.
>
>
> The reasons for preferring a small, fast, possibly SLC, dedicated ZIL
> device (or mirror if you're paranoid) are:
>
> 1: It is a lot easier to replace the ZIL than a bunch of larger drives if
> it wears out (it can even be done on a running system!)
>
> 2: because of way the dedicated ZIL works, you're sending it sequential
> writes, not random ones (longer life)
>
> 3: Assuming the device is not ridiculously small, it will coalesce most of
> the 4kb writes into something meaningful (and mostly sequential) for the
> main vdev, saving write-amplification issues there as well as avoiding the
> IOPS degradation that a direct random RW scenario causes.
>
> 4: If zfs sync=always, as long as the ZIL device sequential write speed is
> adequate it will keeping up with writes as they happen. No need for
> o_direct games
>
> and
>
> 5: It allows you to use cheaper, slower SSDs for the main vdev(s) than if
> you try to achieve the same IO loading with direct writes/reads. For this
> scenario the slowish vdev SSDs can be bracketed by high-performance
> cache/zil devices (assuming CPU/RAM/Network can keep up too)
>
>
> As always: YMMV.
>
>
> It's worth noting that every single vendor pitching ZFS at my dayjob is
> specifying 8Gb STEC ZeusRam drives for ZIL. These things are hellaciously
> expensive but they're pretty much guaranteed to keep up.
>
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 12:48:19 UTC
Permalink
On 04/11/13 11:50, Gordan Bobic wrote:
> unless you are running on very cheap, poorly optimized SSD on which
> linear writes are massively faster than random writes (i.e. as is the
> case with typical USB sticks or SD/CF cards).

If you sit down and thrash the heck out of SSDs you'll find that the
issues become:

1: Most SSDs still don't handle simultaneous reads and writes very
well(*)(**) - you can pretty much halve (or more) the worst case IOPS
published for a drive if you interleave reads and writes. It's best to
use decent sized chunks for access if you can and that's where a ZIL wins.


2: Even with all the optimisations being shoehorned into SSDs (eg
Samsung 840s EVOs carving off a SLC section, others using RAIN
internally), random writes are still slower than linear writes, even if
not "massively slower"

Having said that, throwing zillions of o_direct random 4k writes at an
array isn't going to work very well at all (for fairness the OP should
be comparing raid6+lvm+ext4 with options data=journal and
journal_checksum enabled to his ZFS setup). A ZIL drive should mitigate
things a little, but ZFS simply isn't designed for that kind of workload
and having the wrong ashift for the vdev will make things much worse.

With that kind of load on ZFS - SSDs or not - one needs to take a leaf
from Oracle's playbook when they put ZFS behind one of their databases
(Lots of ram, large cache, large ZIL and no games trying to force
o_direct. The ZIL drive is there to catch sync writes, let it do what
it's designed for.

With ZFS badly hobbled in the described setup, it's a wonder it performs
as well as it does. I don't care if it's generating 25:1 write
amplification to the drive as long as this is only happening for brief
periods.


(*) This is one of my pet peeves about virtually all published HDD/SSD
benchmarks - they only show "all writes" or "all reads", not the mixed
workloads we really need to see.

(**) They used to be _awful_ and a lot of the cheaper drives still are.
ZFS tacitly admits this with the l2arc_norw parameter. If all-ssd
systems become common then a wider variation of this switch might be needed.



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 04:02:29 UTC
Permalink
I would love to see your config and an 'iostat' during your big
operations. I get stuck at 280K 4K random reads with q=200 and it is a
pure fight against latency. If you are getting more, I suspect some IOs
are coming from RAM.

In terms of writing 2.5GB/sec sequentially, I guess if you throw a lot of
hardware at it, it would be possible. My understanding is that ZFS does an
SHA compute on writes. My system does SHA at 216MB/sec/core (according to
openssl test) which would be 1.2GB/sec for 6 cores. If you have 16, I
suspect you could get 2.5, but not much more. If ZFS has a faster SHA
algorithm (or has shifted to something else), I want it;

One of my target applications is VDI hosting. To get the best bang for the
buck, you needs lots of 4K IOPS, mostly on writes. I don't really see a
way to tweak the ashift to hit this cleanly. I have 24-drive SSDs arrays
in the field supporting > 4000 seats from a single SAN node.

With questions like "is mirroring too expensive", it depends on the
alternatives. If you can get 1M 4K writes allowing for two concurrent
drive failures and still get 90% of the raw drive's capacity plus real-time
de-dupe, then ZFS and mirroring seems pretty costly.

I do think some sort of documentation on write amplification is in order.
If you read about raid-z#, you expect some extra writes, but not a
multiplying effect. It is not hard to see what the underlying drives do
(just grab /proc/diskstats in linux) and compare this with the IO actually
written. I would not be too surprised if some of the amplification I am
seeing is actually accidental with schedulers firing off early and hurting
HDD performance as well.

Again, thanks for all the help.

Doug Dumitru
EasyCo LLC


On Sunday, November 3, 2013 6:38:54 PM UTC-8, Marcus Sorensen wrote:
>
> I've run all-ssd arrays of up to 20 drives and not gotten results as
> horrible as yours. I don't have enough time to dig into what the difference
> is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
> simultaneously), then exported/imported the pool to clear arc and read them
> back at 5GB/s. Exporting/importing and then running random reads on the
> files yielded ~600k iops.
>
> I do still think that you need to increase your zvol block size, you
> shouldn't be getting 5:1 writes unless you are fixing the record size to be
> close to your ashift. You want your record size to be number of data disks
> * ashift for raidzX. It may cause you to read a bit more if you always only
> have small random io, but it shouldn't make too much of a difference. Arc
> will help, but not with a benchmark designed to avoid it. Also, direct and
> sync are two different things. As far as I remember zfs Linux doesn't even
> support directio, but you can still issue syncs to ruin your io as much as
> you want.
>
> In the end each filesystem is a tradeoff. I recently saw that checksum
> support was put into the kernel for xfs, which is good but it won't be able
> to correct errors. I think other filesystem's might give me better numbers
> in some cases, but I need the features.
> On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org<javascript:>>
> wrote:
>
>> Andrew,
>>
>> My tests without O_DIRECT are only marginally better. My understanding
>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>
>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>> arrays.
>>
>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
>> 1.5GB/sec of linear IO.
>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>> for full pools.
>>
>> If you think this is in error, then please let me know. I am not saying
>> that ZFS is "bad", it is just designed to address a different set of
>> problems in a different run-time environment.
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Alex Narayan
2013-11-04 04:07:35 UTC
Permalink
If it's the hashing that's taking up cycles and slowing the system down
perhaps a fast cpu at say 4 or more ghz could help?
On Nov 3, 2013 8:02 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> I would love to see your config and an 'iostat' during your big
> operations. I get stuck at 280K 4K random reads with q=200 and it is a
> pure fight against latency. If you are getting more, I suspect some IOs
> are coming from RAM.
>
> In terms of writing 2.5GB/sec sequentially, I guess if you throw a lot of
> hardware at it, it would be possible. My understanding is that ZFS does an
> SHA compute on writes. My system does SHA at 216MB/sec/core (according to
> openssl test) which would be 1.2GB/sec for 6 cores. If you have 16, I
> suspect you could get 2.5, but not much more. If ZFS has a faster SHA
> algorithm (or has shifted to something else), I want it;
>
> One of my target applications is VDI hosting. To get the best bang for
> the buck, you needs lots of 4K IOPS, mostly on writes. I don't really see
> a way to tweak the ashift to hit this cleanly. I have 24-drive SSDs arrays
> in the field supporting > 4000 seats from a single SAN node.
>
> With questions like "is mirroring too expensive", it depends on the
> alternatives. If you can get 1M 4K writes allowing for two concurrent
> drive failures and still get 90% of the raw drive's capacity plus real-time
> de-dupe, then ZFS and mirroring seems pretty costly.
>
> I do think some sort of documentation on write amplification is in order.
> If you read about raid-z#, you expect some extra writes, but not a
> multiplying effect. It is not hard to see what the underlying drives do
> (just grab /proc/diskstats in linux) and compare this with the IO actually
> written. I would not be too surprised if some of the amplification I am
> seeing is actually accidental with schedulers firing off early and hurting
> HDD performance as well.
>
> Again, thanks for all the help.
>
> Doug Dumitru
> EasyCo LLC
>
>
> On Sunday, November 3, 2013 6:38:54 PM UTC-8, Marcus Sorensen wrote:
>>
>> I've run all-ssd arrays of up to 20 drives and not gotten results as
>> horrible as yours. I don't have enough time to dig into what the difference
>> is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
>> simultaneously), then exported/imported the pool to clear arc and read them
>> back at 5GB/s. Exporting/importing and then running random reads on the
>> files yielded ~600k iops.
>>
>> I do still think that you need to increase your zvol block size, you
>> shouldn't be getting 5:1 writes unless you are fixing the record size to be
>> close to your ashift. You want your record size to be number of data disks
>> * ashift for raidzX. It may cause you to read a bit more if you always only
>> have small random io, but it shouldn't make too much of a difference. Arc
>> will help, but not with a benchmark designed to avoid it. Also, direct and
>> sync are two different things. As far as I remember zfs Linux doesn't even
>> support directio, but you can still issue syncs to ruin your io as much as
>> you want.
>>
>> In the end each filesystem is a tradeoff. I recently saw that checksum
>> support was put into the kernel for xfs, which is good but it won't be able
>> to correct errors. I think other filesystem's might give me better numbers
>> in some cases, but I need the features.
>> On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org> wrote:
>>
>>> Andrew,
>>>
>>> My tests without O_DIRECT are only marginally better. My understanding
>>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>>
>>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>>> arrays.
>>>
>>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO
>>> or 1.5GB/sec of linear IO.
>>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>>> for full pools.
>>>
>>> If you think this is in error, then please let me know. I am not saying
>>> that ZFS is "bad", it is just designed to address a different set of
>>> problems in a different run-time environment.
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 04:13:37 UTC
Permalink
On Sunday, November 3, 2013 8:07:35 PM UTC-8, Alex Narayan wrote:
>
> If it's the hashing that's taking up cycles and slowing the system down
> perhaps a fast cpu at say 4 or more ghz could help?
>
I do suspect that hashing (and other CPU stuff) is a real issue. My array
is good for 6GB/sec on writes and 10GB/sec+ on reads. My CPU is fast (6
core, 3.3-3.8GHz) but faster are out there (at least getting more cores is
easy).

Doug Dumitru
EasyCo LLC



> On Nov 3, 2013 8:02 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org<javascript:>>
> wrote:
>
>> I would love to see your config and an 'iostat' during your big
>> operations. I get stuck at 280K 4K random reads with q=200 and it is a
>> pure fight against latency. If you are getting more, I suspect some IOs
>> are coming from RAM.
>>
>> In terms of writing 2.5GB/sec sequentially, I guess if you throw a lot of
>> hardware at it, it would be possible. My understanding is that ZFS does an
>> SHA compute on writes. My system does SHA at 216MB/sec/core (according to
>> openssl test) which would be 1.2GB/sec for 6 cores. If you have 16, I
>> suspect you could get 2.5, but not much more. If ZFS has a faster SHA
>> algorithm (or has shifted to something else), I want it;
>>
>> One of my target applications is VDI hosting. To get the best bang for
>> the buck, you needs lots of 4K IOPS, mostly on writes. I don't really see
>> a way to tweak the ashift to hit this cleanly. I have 24-drive SSDs arrays
>> in the field supporting > 4000 seats from a single SAN node.
>>
>> With questions like "is mirroring too expensive", it depends on the
>> alternatives. If you can get 1M 4K writes allowing for two concurrent
>> drive failures and still get 90% of the raw drive's capacity plus real-time
>> de-dupe, then ZFS and mirroring seems pretty costly.
>>
>> I do think some sort of documentation on write amplification is in
>> order. If you read about raid-z#, you expect some extra writes, but not a
>> multiplying effect. It is not hard to see what the underlying drives do
>> (just grab /proc/diskstats in linux) and compare this with the IO actually
>> written. I would not be too surprised if some of the amplification I am
>> seeing is actually accidental with schedulers firing off early and hurting
>> HDD performance as well.
>>
>> Again, thanks for all the help.
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>>
>> On Sunday, November 3, 2013 6:38:54 PM UTC-8, Marcus Sorensen wrote:
>>>
>>> I've run all-ssd arrays of up to 20 drives and not gotten results as
>>> horrible as yours. I don't have enough time to dig into what the difference
>>> is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
>>> simultaneously), then exported/imported the pool to clear arc and read them
>>> back at 5GB/s. Exporting/importing and then running random reads on the
>>> files yielded ~600k iops.
>>>
>>> I do still think that you need to increase your zvol block size, you
>>> shouldn't be getting 5:1 writes unless you are fixing the record size to be
>>> close to your ashift. You want your record size to be number of data disks
>>> * ashift for raidzX. It may cause you to read a bit more if you always only
>>> have small random io, but it shouldn't make too much of a difference. Arc
>>> will help, but not with a benchmark designed to avoid it. Also, direct and
>>> sync are two different things. As far as I remember zfs Linux doesn't even
>>> support directio, but you can still issue syncs to ruin your io as much as
>>> you want.
>>>
>>> In the end each filesystem is a tradeoff. I recently saw that checksum
>>> support was put into the kernel for xfs, which is good but it won't be able
>>> to correct errors. I think other filesystem's might give me better numbers
>>> in some cases, but I need the features.
>>> On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org**> wrote:
>>>
>>>> Andrew,
>>>>
>>>> My tests without O_DIRECT are only marginally better. My understanding
>>>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>>>
>>>> My overall take on ZFS is somewhat harsh when considering large puse
>>>> SSD arrays.
>>>>
>>>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO
>>>> or 1.5GB/sec of linear IO.
>>>> * ZFS has wear amplification ranging from 2:1 for simple, no
>>>> redundancy, stripe sets, to over 20:1 with triple raid parity. This likely
>>>> gets worse for full pools.
>>>>
>>>> If you think this is in error, then please let me know. I am not
>>>> saying that ZFS is "bad", it is just designed to address a different set of
>>>> problems in a different run-time environment.
>>>>
>>>> Doug Dumitru
>>>> EasyCo LLC
>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to zfs-discuss...@**zfsonlinux.org.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 04:22:07 UTC
Permalink
No, the iops aren't coming from ram. When they do I get 17GB/s and 1.2M
read iops. I found by and large that the performance was card limited, once
I got beyond 12 disks on a card I wouldn't get more. Cards connected to
expanders were particularly bad (9211-8i or 9207-8i), only providing
20Gbit/s per card, regardless of whether I used one 24Gbit/s port or spread
across both. Switched backplanes for direct connect and 16 port SATA cards
and it was much better.

I'll look at the amplification in detail if I get a moment. I'll look
directly at /sys though, as well as the SMART lifetime write data.

I played a lot with zvol block size, and didn't see much of a difference at
all on 4k vs 32k writes, not to say that the performance was optimal, just
that I didn't really see penalty outside of margin of error in doing 4k io
on a 32k block zvol. Perhaps because it writes more anyway if you use such
small block size with raidz, as discussed.
On Nov 3, 2013 9:02 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> I would love to see your config and an 'iostat' during your big
> operations. I get stuck at 280K 4K random reads with q=200 and it is a
> pure fight against latency. If you are getting more, I suspect some IOs
> are coming from RAM.
>
> In terms of writing 2.5GB/sec sequentially, I guess if you throw a lot of
> hardware at it, it would be possible. My understanding is that ZFS does an
> SHA compute on writes. My system does SHA at 216MB/sec/core (according to
> openssl test) which would be 1.2GB/sec for 6 cores. If you have 16, I
> suspect you could get 2.5, but not much more. If ZFS has a faster SHA
> algorithm (or has shifted to something else), I want it;
>
> One of my target applications is VDI hosting. To get the best bang for
> the buck, you needs lots of 4K IOPS, mostly on writes. I don't really see
> a way to tweak the ashift to hit this cleanly. I have 24-drive SSDs arrays
> in the field supporting > 4000 seats from a single SAN node.
>
> With questions like "is mirroring too expensive", it depends on the
> alternatives. If you can get 1M 4K writes allowing for two concurrent
> drive failures and still get 90% of the raw drive's capacity plus real-time
> de-dupe, then ZFS and mirroring seems pretty costly.
>
> I do think some sort of documentation on write amplification is in order.
> If you read about raid-z#, you expect some extra writes, but not a
> multiplying effect. It is not hard to see what the underlying drives do
> (just grab /proc/diskstats in linux) and compare this with the IO actually
> written. I would not be too surprised if some of the amplification I am
> seeing is actually accidental with schedulers firing off early and hurting
> HDD performance as well.
>
> Again, thanks for all the help.
>
> Doug Dumitru
> EasyCo LLC
>
>
> On Sunday, November 3, 2013 6:38:54 PM UTC-8, Marcus Sorensen wrote:
>>
>> I've run all-ssd arrays of up to 20 drives and not gotten results as
>> horrible as yours. I don't have enough time to dig into what the difference
>> is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
>> simultaneously), then exported/imported the pool to clear arc and read them
>> back at 5GB/s. Exporting/importing and then running random reads on the
>> files yielded ~600k iops.
>>
>> I do still think that you need to increase your zvol block size, you
>> shouldn't be getting 5:1 writes unless you are fixing the record size to be
>> close to your ashift. You want your record size to be number of data disks
>> * ashift for raidzX. It may cause you to read a bit more if you always only
>> have small random io, but it shouldn't make too much of a difference. Arc
>> will help, but not with a benchmark designed to avoid it. Also, direct and
>> sync are two different things. As far as I remember zfs Linux doesn't even
>> support directio, but you can still issue syncs to ruin your io as much as
>> you want.
>>
>> In the end each filesystem is a tradeoff. I recently saw that checksum
>> support was put into the kernel for xfs, which is good but it won't be able
>> to correct errors. I think other filesystem's might give me better numbers
>> in some cases, but I need the features.
>> On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org> wrote:
>>
>>> Andrew,
>>>
>>> My tests without O_DIRECT are only marginally better. My understanding
>>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>>
>>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>>> arrays.
>>>
>>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO
>>> or 1.5GB/sec of linear IO.
>>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>>> for full pools.
>>>
>>> If you think this is in error, then please let me know. I am not saying
>>> that ZFS is "bad", it is just designed to address a different set of
>>> problems in a different run-time environment.
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 04:23:34 UTC
Permalink
On writes I think Fletcher is default, maybe not, but its faster than SHA
and still very unlikely to collide.
On Nov 3, 2013 9:02 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> I would love to see your config and an 'iostat' during your big
> operations. I get stuck at 280K 4K random reads with q=200 and it is a
> pure fight against latency. If you are getting more, I suspect some IOs
> are coming from RAM.
>
> In terms of writing 2.5GB/sec sequentially, I guess if you throw a lot of
> hardware at it, it would be possible. My understanding is that ZFS does an
> SHA compute on writes. My system does SHA at 216MB/sec/core (according to
> openssl test) which would be 1.2GB/sec for 6 cores. If you have 16, I
> suspect you could get 2.5, but not much more. If ZFS has a faster SHA
> algorithm (or has shifted to something else), I want it;
>
> One of my target applications is VDI hosting. To get the best bang for
> the buck, you needs lots of 4K IOPS, mostly on writes. I don't really see
> a way to tweak the ashift to hit this cleanly. I have 24-drive SSDs arrays
> in the field supporting > 4000 seats from a single SAN node.
>
> With questions like "is mirroring too expensive", it depends on the
> alternatives. If you can get 1M 4K writes allowing for two concurrent
> drive failures and still get 90% of the raw drive's capacity plus real-time
> de-dupe, then ZFS and mirroring seems pretty costly.
>
> I do think some sort of documentation on write amplification is in order.
> If you read about raid-z#, you expect some extra writes, but not a
> multiplying effect. It is not hard to see what the underlying drives do
> (just grab /proc/diskstats in linux) and compare this with the IO actually
> written. I would not be too surprised if some of the amplification I am
> seeing is actually accidental with schedulers firing off early and hurting
> HDD performance as well.
>
> Again, thanks for all the help.
>
> Doug Dumitru
> EasyCo LLC
>
>
> On Sunday, November 3, 2013 6:38:54 PM UTC-8, Marcus Sorensen wrote:
>>
>> I've run all-ssd arrays of up to 20 drives and not gotten results as
>> horrible as yours. I don't have enough time to dig into what the difference
>> is, but I can say I've written 200GB sequentially at 2.5GB/s (20 10GB files
>> simultaneously), then exported/imported the pool to clear arc and read them
>> back at 5GB/s. Exporting/importing and then running random reads on the
>> files yielded ~600k iops.
>>
>> I do still think that you need to increase your zvol block size, you
>> shouldn't be getting 5:1 writes unless you are fixing the record size to be
>> close to your ashift. You want your record size to be number of data disks
>> * ashift for raidzX. It may cause you to read a bit more if you always only
>> have small random io, but it shouldn't make too much of a difference. Arc
>> will help, but not with a benchmark designed to avoid it. Also, direct and
>> sync are two different things. As far as I remember zfs Linux doesn't even
>> support directio, but you can still issue syncs to ruin your io as much as
>> you want.
>>
>> In the end each filesystem is a tradeoff. I recently saw that checksum
>> support was put into the kernel for xfs, which is good but it won't be able
>> to correct errors. I think other filesystem's might give me better numbers
>> in some cases, but I need the features.
>> On Nov 3, 2013 6:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org> wrote:
>>
>>> Andrew,
>>>
>>> My tests without O_DIRECT are only marginally better. My understanding
>>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>>
>>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>>> arrays.
>>>
>>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO
>>> or 1.5GB/sec of linear IO.
>>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>>> for full pools.
>>>
>>> If you think this is in error, then please let me know. I am not saying
>>> that ZFS is "bad", it is just designed to address a different set of
>>> problems in a different run-time environment.
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>>>
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Chris Siebenmann
2013-11-04 04:32:29 UTC
Permalink
| On writes I think Fletcher is default, maybe not, but its faster than
| SHA and still very unlikely to collide.

I believe that fletcher4 is the default checksum if you are not using
dedup and SHA the default checksum if you are. In a non-dedup situation
collisions are unimportant because the checksum's only purpose is to
detect block damage, so a higher risk of deliberate collisions is not
particularly bad.

- cks

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 05:03:21 UTC
Permalink
Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:

cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
/proc/diskstats

field 7 after the partition/device name yields sectors written, I used
this to extrapolate:

sda2 529011
sdb2 529005
sdc2 528990
sdd2 529022
sde2 529011
total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
about right for 4:1 data:parity. Next up, zvol...

On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <cks-***@public.gmane.org> wrote:
> | On writes I think Fletcher is default, maybe not, but its faster than
> | SHA and still very unlikely to collide.
>
> I believe that fletcher4 is the default checksum if you are not using
> dedup and SHA the default checksum if you are. In a non-dedup situation
> collisions are unimportant because the checksum's only purpose is to
> detect block damage, so a higher risk of deliberate collisions is not
> particularly bad.
>
> - cks
>
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 05:21:05 UTC
Permalink
I should mention I just stuck with all defaults unless specificied.

Here's a zvol with 4k blocksize (note my ashift is 9 on these drives,
so it's not going to penalize me. I'd have to reconfigure this or find
some other test disks to do 4k ashift):

cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
bs=1M && cat /proc/diskstats

539244
539211
539224
539247
539239
2696165 1380436480, again 1.3GB written for 1GB of data

Now let's re-do that but sync with every block:

cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
bs=1M oflag=dsync && cat /proc/diskstats

1137562
1137590
1137594
1137595
1137526
5687867 2912187904

Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
have to ask, does your application really call sync for every 4k
random write, or is that just your benchmark?

On Sun, Nov 3, 2013 at 10:03 PM, Marcus Sorensen <shadowsor-***@public.gmane.org> wrote:
> Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:
>
> cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
> /proc/diskstats
>
> field 7 after the partition/device name yields sectors written, I used
> this to extrapolate:
>
> sda2 529011
> sdb2 529005
> sdc2 528990
> sdd2 529022
> sde2 529011
> total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
> about right for 4:1 data:parity. Next up, zvol...
>
> On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <cks-***@public.gmane.org> wrote:
>> | On writes I think Fletcher is default, maybe not, but its faster than
>> | SHA and still very unlikely to collide.
>>
>> I believe that fletcher4 is the default checksum if you are not using
>> dedup and SHA the default checksum if you are. In a non-dedup situation
>> collisions are unimportant because the checksum's only purpose is to
>> detect block damage, so a higher risk of deliberate collisions is not
>> particularly bad.
>>
>> - cks
>>
>> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 05:40:45 UTC
Permalink
I think there are a couple of reasons I am getting other numbers. Mainly,
I am overwriting data randomly, not just doing a linear fill. In your last
example, this is a bit surprising it is this bad. You are after all doing
a bs=1M, so syncing should not hurt that much. Try it with oflag=direct
and you will get the logic my benchmark does.

With bs=1M, the bio's hitting the block layer should have 256 4K pages
which should be enough at one shot to keep raidz pretty happy. So the
2.9:1 write amplification is again a mystery.

In terms of "do I have to do a sync", I am running oflag=direct which is
similar to driving bio requests from kernel space. No copies, no block
device buffers.

One other "gotcha" to check is run a cat /proc/meminfo after these commands
to make sure you don't have a huge "Dirty" memory set that will get written
sometime later.

Doug

On Sunday, November 3, 2013 9:21:05 PM UTC-8, Marcus Sorensen wrote:
>
> I should mention I just stuck with all defaults unless specificied.
>
> Here's a zvol with 4k blocksize (note my ashift is 9 on these drives,
> so it's not going to penalize me. I'd have to reconfigure this or find
> some other test disks to do 4k ashift):
>
> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
> bs=1M && cat /proc/diskstats
>
> 539244
> 539211
> 539224
> 539247
> 539239
> 2696165 1380436480, again 1.3GB written for 1GB of data
>
> Now let's re-do that but sync with every block:
>
> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
> bs=1M oflag=dsync && cat /proc/diskstats
>
> 1137562
> 1137590
> 1137594
> 1137595
> 1137526
> 5687867 2912187904
>
> Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
> have to ask, does your application really call sync for every 4k
> random write, or is that just your benchmark?
>
> On Sun, Nov 3, 2013 at 10:03 PM, Marcus Sorensen <shad...-***@public.gmane.org<javascript:>>
> wrote:
> > Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:
> >
> > cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
> > /proc/diskstats
> >
> > field 7 after the partition/device name yields sectors written, I used
> > this to extrapolate:
> >
> > sda2 529011
> > sdb2 529005
> > sdc2 528990
> > sdd2 529022
> > sde2 529011
> > total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
> > about right for 4:1 data:parity. Next up, zvol...
> >
> > On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <c...-***@public.gmane.org<javascript:>>
> wrote:
> >> | On writes I think Fletcher is default, maybe not, but its faster than
> >> | SHA and still very unlikely to collide.
> >>
> >> I believe that fletcher4 is the default checksum if you are not using
> >> dedup and SHA the default checksum if you are. In a non-dedup situation
> >> collisions are unimportant because the checksum's only purpose is to
> >> detect block damage, so a higher risk of deliberate collisions is not
> >> particularly bad.
> >>
> >> - cks
> >>
> >> To unsubscribe from this group and stop receiving emails from it, send
> an email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 05:42:51 UTC
Permalink
But direct isn't the same as sync...

On Sun, Nov 3, 2013 at 10:40 PM, Doug Dumitru
<dougdumitruredirect-***@public.gmane.org> wrote:
> I think there are a couple of reasons I am getting other numbers. Mainly, I
> am overwriting data randomly, not just doing a linear fill. In your last
> example, this is a bit surprising it is this bad. You are after all doing a
> bs=1M, so syncing should not hurt that much. Try it with oflag=direct and
> you will get the logic my benchmark does.
>
> With bs=1M, the bio's hitting the block layer should have 256 4K pages which
> should be enough at one shot to keep raidz pretty happy. So the 2.9:1 write
> amplification is again a mystery.
>
> In terms of "do I have to do a sync", I am running oflag=direct which is
> similar to driving bio requests from kernel space. No copies, no block
> device buffers.
>
> One other "gotcha" to check is run a cat /proc/meminfo after these commands
> to make sure you don't have a huge "Dirty" memory set that will get written
> sometime later.
>
> Doug
>
>
> On Sunday, November 3, 2013 9:21:05 PM UTC-8, Marcus Sorensen wrote:
>>
>> I should mention I just stuck with all defaults unless specificied.
>>
>> Here's a zvol with 4k blocksize (note my ashift is 9 on these drives,
>> so it's not going to penalize me. I'd have to reconfigure this or find
>> some other test disks to do 4k ashift):
>>
>> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
>> bs=1M && cat /proc/diskstats
>>
>> 539244
>> 539211
>> 539224
>> 539247
>> 539239
>> 2696165 1380436480, again 1.3GB written for 1GB of data
>>
>> Now let's re-do that but sync with every block:
>>
>> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
>> bs=1M oflag=dsync && cat /proc/diskstats
>>
>> 1137562
>> 1137590
>> 1137594
>> 1137595
>> 1137526
>> 5687867 2912187904
>>
>> Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
>> have to ask, does your application really call sync for every 4k
>> random write, or is that just your benchmark?
>>
>> On Sun, Nov 3, 2013 at 10:03 PM, Marcus Sorensen <shad...-***@public.gmane.org>
>> wrote:
>> > Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:
>> >
>> > cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
>> > /proc/diskstats
>> >
>> > field 7 after the partition/device name yields sectors written, I used
>> > this to extrapolate:
>> >
>> > sda2 529011
>> > sdb2 529005
>> > sdc2 528990
>> > sdd2 529022
>> > sde2 529011
>> > total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
>> > about right for 4:1 data:parity. Next up, zvol...
>> >
>> > On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <c...-***@public.gmane.org>
>> > wrote:
>> >> | On writes I think Fletcher is default, maybe not, but its faster than
>> >> | SHA and still very unlikely to collide.
>> >>
>> >> I believe that fletcher4 is the default checksum if you are not using
>> >> dedup and SHA the default checksum if you are. In a non-dedup situation
>> >> collisions are unimportant because the checksum's only purpose is to
>> >> detect block damage, so a higher risk of deliberate collisions is not
>> >> particularly bad.
>> >>
>> >> - cks
>> >>
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> >> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gregor Kopka
2013-11-04 08:47:52 UTC
Permalink
Regarding O_DIRECT: https://github.com/zfsonlinux/zfs/issues/224

Gregor


Am 04.11.2013 06:42, schrieb Marcus Sorensen:
> But direct isn't the same as sync...
>
> On Sun, Nov 3, 2013 at 10:40 PM, Doug Dumitru
> <dougdumitruredirect-***@public.gmane.org> wrote:
>> I think there are a couple of reasons I am getting other numbers. Mainly, I
>> am overwriting data randomly, not just doing a linear fill. In your last
>> example, this is a bit surprising it is this bad. You are after all doing a
>> bs=1M, so syncing should not hurt that much. Try it with oflag=direct and
>> you will get the logic my benchmark does.
>>
>> With bs=1M, the bio's hitting the block layer should have 256 4K pages which
>> should be enough at one shot to keep raidz pretty happy. So the 2.9:1 write
>> amplification is again a mystery.
>>
>> In terms of "do I have to do a sync", I am running oflag=direct which is
>> similar to driving bio requests from kernel space. No copies, no block
>> device buffers.
>>
>> One other "gotcha" to check is run a cat /proc/meminfo after these commands
>> to make sure you don't have a huge "Dirty" memory set that will get written
>> sometime later.
>>
>> Doug
>>
>>
>> On Sunday, November 3, 2013 9:21:05 PM UTC-8, Marcus Sorensen wrote:
>>> I should mention I just stuck with all defaults unless specificied.
>>>
>>> Here's a zvol with 4k blocksize (note my ashift is 9 on these drives,
>>> so it's not going to penalize me. I'd have to reconfigure this or find
>>> some other test disks to do 4k ashift):
>>>
>>> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
>>> bs=1M && cat /proc/diskstats
>>>
>>> 539244
>>> 539211
>>> 539224
>>> 539247
>>> 539239
>>> 2696165 1380436480, again 1.3GB written for 1GB of data
>>>
>>> Now let's re-do that but sync with every block:
>>>
>>> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
>>> bs=1M oflag=dsync && cat /proc/diskstats
>>>
>>> 1137562
>>> 1137590
>>> 1137594
>>> 1137595
>>> 1137526
>>> 5687867 2912187904
>>>
>>> Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
>>> have to ask, does your application really call sync for every 4k
>>> random write, or is that just your benchmark?
>>>
>>> On Sun, Nov 3, 2013 at 10:03 PM, Marcus Sorensen <shad...-***@public.gmane.org>
>>> wrote:
>>>> Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:
>>>>
>>>> cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
>>>> /proc/diskstats
>>>>
>>>> field 7 after the partition/device name yields sectors written, I used
>>>> this to extrapolate:
>>>>
>>>> sda2 529011
>>>> sdb2 529005
>>>> sdc2 528990
>>>> sdd2 529022
>>>> sde2 529011
>>>> total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
>>>> about right for 4:1 data:parity. Next up, zvol...
>>>>
>>>> On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <c...-***@public.gmane.org>
>>>> wrote:
>>>>> | On writes I think Fletcher is default, maybe not, but its faster than
>>>>> | SHA and still very unlikely to collide.
>>>>>
>>>>> I believe that fletcher4 is the default checksum if you are not using
>>>>> dedup and SHA the default checksum if you are. In a non-dedup situation
>>>>> collisions are unimportant because the checksum's only purpose is to
>>>>> detect block damage, so a higher risk of deliberate collisions is not
>>>>> particularly bad.
>>>>>
>>>>> - cks
>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 06:30:39 UTC
Permalink
I think it's because I have variable block size (recordsize) on the
zfs filesystem, as opposed to static with the zvols. I tried 4k random
on this filesystem instead of 1M sequential, and it wrote 8.5GB for
1G. So I write 4k, but it may be a part of a 128k stripe or something,
and I may be rewriting that multiple times. I'd have to zdb the thing
to verify the transaction sizes, though, and that's time consuming.

If I understand, you're using this for VMs. If I start a qemu instance
and point it to a zvol with cache=none, it will use o_direct (which
seems to work better with zvols than with the zfs filesystem itself,
but I really have no idea if that actually works with qemu), but that
doesn't mean that all writes from the VM are sync. The VM OS itself
is write caching and flushing in larger chunks, and even when it's not
write caching, direct io at the hypervisor level doesn't guarantee the
write is flushed like sync does, it just avoids caching at the host
level. So unlike sync, I don't think direct io forces the transaction
group to flush. All said, you know how your VDI actually behaves.

On Sun, Nov 3, 2013 at 10:40 PM, Doug Dumitru
<dougdumitruredirect-***@public.gmane.org> wrote:
> I think there are a couple of reasons I am getting other numbers. Mainly, I
> am overwriting data randomly, not just doing a linear fill. In your last
> example, this is a bit surprising it is this bad. You are after all doing a
> bs=1M, so syncing should not hurt that much. Try it with oflag=direct and
> you will get the logic my benchmark does.
>
> With bs=1M, the bio's hitting the block layer should have 256 4K pages which
> should be enough at one shot to keep raidz pretty happy. So the 2.9:1 write
> amplification is again a mystery.
>
> In terms of "do I have to do a sync", I am running oflag=direct which is
> similar to driving bio requests from kernel space. No copies, no block
> device buffers.
>
> One other "gotcha" to check is run a cat /proc/meminfo after these commands
> to make sure you don't have a huge "Dirty" memory set that will get written
> sometime later.
>
> Doug
>
>
> On Sunday, November 3, 2013 9:21:05 PM UTC-8, Marcus Sorensen wrote:
>>
>> I should mention I just stuck with all defaults unless specificied.
>>
>> Here's a zvol with 4k blocksize (note my ashift is 9 on these drives,
>> so it's not going to penalize me. I'd have to reconfigure this or find
>> some other test disks to do 4k ashift):
>>
>> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
>> bs=1M && cat /proc/diskstats
>>
>> 539244
>> 539211
>> 539224
>> 539247
>> 539239
>> 2696165 1380436480, again 1.3GB written for 1GB of data
>>
>> Now let's re-do that but sync with every block:
>>
>> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
>> bs=1M oflag=dsync && cat /proc/diskstats
>>
>> 1137562
>> 1137590
>> 1137594
>> 1137595
>> 1137526
>> 5687867 2912187904
>>
>> Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
>> have to ask, does your application really call sync for every 4k
>> random write, or is that just your benchmark?
>>
>> On Sun, Nov 3, 2013 at 10:03 PM, Marcus Sorensen <shad...-***@public.gmane.org>
>> wrote:
>> > Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:
>> >
>> > cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
>> > /proc/diskstats
>> >
>> > field 7 after the partition/device name yields sectors written, I used
>> > this to extrapolate:
>> >
>> > sda2 529011
>> > sdb2 529005
>> > sdc2 528990
>> > sdd2 529022
>> > sde2 529011
>> > total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
>> > about right for 4:1 data:parity. Next up, zvol...
>> >
>> > On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <c...-***@public.gmane.org>
>> > wrote:
>> >> | On writes I think Fletcher is default, maybe not, but its faster than
>> >> | SHA and still very unlikely to collide.
>> >>
>> >> I believe that fletcher4 is the default checksum if you are not using
>> >> dedup and SHA the default checksum if you are. In a non-dedup situation
>> >> collisions are unimportant because the checksum's only purpose is to
>> >> detect block damage, so a higher risk of deliberate collisions is not
>> >> particularly bad.
>> >>
>> >> - cks
>> >>
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> >> an email to zfs-discuss...-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Marcus Sorensen
2013-11-04 06:12:44 UTC
Permalink
Here's random 1M write, iodepth of 8:
fio --filename=/dev/zvol/data/vol4k --ioengine=sync --iodepth=8
--rw=randwrite --bs=1m --direct=1 --size=1G --numjobs=1 --name=stuff

467659
467590
467696
467874
467806
2338625 1197376000, ~1.2GB for 1G

And random 4k:
fio --filename=/dev/zvol/data/vol4k --ioengine=sync --iodepth=8
--rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --name=stuff

777457
777291
777259
777417
777618
3887042 1990165504, ~2GB for 1G

8 parallel threads doing random 4k

cat /proc/diskstats && fio --filename=/dev/zvol/data/vol4k
--ioengine=sync --iodepth=8 --rw=randwrite --bs=4k --direct=1
--size=1G --thread --numjobs=8 --name=stuff && cat /proc/diskstats

6119350
6119160
6119567
6120793
6121466
30600336 15667372032, ~15.6GB for 8G written

So at least with 512b ashift and 4k blocksize on a 5 disk raidz, I'm
only seeing ~2x amplification worst case

On Sun, Nov 3, 2013 at 10:21 PM, Marcus Sorensen <shadowsor-***@public.gmane.org> wrote:
> I should mention I just stuck with all defaults unless specificied.
>
> Here's a zvol with 4k blocksize (note my ashift is 9 on these drives,
> so it's not going to penalize me. I'd have to reconfigure this or find
> some other test disks to do 4k ashift):
>
> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
> bs=1M && cat /proc/diskstats
>
> 539244
> 539211
> 539224
> 539247
> 539239
> 2696165 1380436480, again 1.3GB written for 1GB of data
>
> Now let's re-do that but sync with every block:
>
> cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
> bs=1M oflag=dsync && cat /proc/diskstats
>
> 1137562
> 1137590
> 1137594
> 1137595
> 1137526
> 5687867 2912187904
>
> Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
> have to ask, does your application really call sync for every 4k
> random write, or is that just your benchmark?
>
> On Sun, Nov 3, 2013 at 10:03 PM, Marcus Sorensen <shadowsor-***@public.gmane.org> wrote:
>> Just a quick test on a ZFS FS, writing 1GB to a 5-disk raidz:
>>
>> cat /proc/diskstats && head -c 1G /dev/urandom > /data/deleteme && cat
>> /proc/diskstats
>>
>> field 7 after the partition/device name yields sectors written, I used
>> this to extrapolate:
>>
>> sda2 529011
>> sdb2 529005
>> sdc2 528990
>> sdd2 529022
>> sde2 529011
>> total 2645039 sectors, 1354259968 bytes, 1.3GB for 1GB data, sounds
>> about right for 4:1 data:parity. Next up, zvol...
>>
>> On Sun, Nov 3, 2013 at 9:32 PM, Chris Siebenmann <cks-***@public.gmane.org> wrote:
>>> | On writes I think Fletcher is default, maybe not, but its faster than
>>> | SHA and still very unlikely to collide.
>>>
>>> I believe that fletcher4 is the default checksum if you are not using
>>> dedup and SHA the default checksum if you are. In a non-dedup situation
>>> collisions are unimportant because the checksum's only purpose is to
>>> detect block damage, so a higher risk of deliberate collisions is not
>>> particularly bad.
>>>
>>> - cks
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Chris Siebenmann
2013-11-04 15:33:18 UTC
Permalink
| Now let's re-do that but sync with every block:
|
| cat /proc/diskstats && dd if=/var/deleteme of=/dev/zvol/data/vol4k
| bs=1M oflag=dsync && cat /proc/diskstats
|
| 1137562
| 1137590
| 1137594
| 1137595
| 1137526
| 5687867 2912187904
|
| Ah, here we get 2.9GB written to disk for only 1GB data. Still, I
| have to ask, does your application really call sync for every 4k
| random write, or is that just your benchmark?

Because of how the ZIL works, frequent sync()s/fsync()s with large
amounts of data are a worst case for write amplification; you'll write
the data once to the ZIL and then again to the regular ZFS tree.

My understanding is that fsync() never forces a TXG commit but instead
always flushes pending data to the ZIL. (There is always a ZIL even if
you don't have a separate SLOG device.)

- cks

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-04 04:09:26 UTC
Permalink
It is nearly always going to "buffer". The question is, is it going to also
engage ZIL mechanics, and if so, where are those writes going to go. You
have to look at your average block size, your COMSTAR LU's writeback cache
setting, the manner of data you're writing (sync or not), the sync and
logbias settings on the pool, and possibly also the cache flush tuneable.

Your 1-1.5 GB/s numbers on SSD are in line with my own experiences for
completely untuned ssd pools doing small block I/O with no consideration
made for zil. I've seen considerably higher numbers, and markedly higher
stable/repeatable numbers. There's a lot to touch on to get there, though.
Far more than I'm willing to tap into this phone.

I've never actually heard anyone comment on wear amplification before, that
is somewhat interesting, but while there's likely some efficiencies to be
had with tuning, proper environment setup, and possibly in code, I'd wager
you're always going to have more than, say, ext4. That's just the nature of
the beast.

ZFS is not bad. However, what it also is not is designed for utmost
performance. If nobody told you that before, that's unfortunate.

That it is as performant as it is is awesome. Everything I have seen and
heard is that it was primarily designed for RELIABILITY, not speed. In
fact, I'd argue performance was a distant third on the agenda behind data
integrity and ease of administration/use. The ARC and the fact that most
use-cases are read-mostly on small subsets of total data can mislead people
to believe ZFS is the fastest filesystem in the west. It is, but only in
those scenarios.

If you're mostly-write or have a huge uncacheable dataset, ZFS is
definitely not fastest. It is one of the safest, still, but that safety
comes at the expense of speed. The amount of effort ZFS puts into writing
your data compared to a lot of other filesystems completely precludes it
from winning any speed wars, and for myself, that's how I like it.

- Andrew
On Nov 3, 2013 5:46 PM, "Doug Dumitru" <dougdumitruredirect-***@public.gmane.org>
wrote:

> Andrew,
>
> My tests without O_DIRECT are only marginally better. My understanding is
> that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>
> My overall take on ZFS is somewhat harsh when considering large puse SSD
> arrays.
>
> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
> 1.5GB/sec of linear IO.
> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
> for full pools.
>
> If you think this is in error, then please let me know. I am not saying
> that ZFS is "bad", it is just designed to address a different set of
> problems in a different run-time environment.
>
> Doug Dumitru
> EasyCo LLC
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Doug Dumitru
2013-11-04 04:19:34 UTC
Permalink
Andrew,

Thank you far a balanced and realistic answer. I am actually quite
impressed that ZFS gets away with the feature set that it has and still
maintains any performance. Realize that before my first post, I did the
typical "google search" to find hints at what I am looking for and what is
"on line" is very difficult to parse and get any real numbers. And your
response is way to long for tapping on a phone. Wait until you get to a
real keyboard ;)

Doug Dumitru
EasyCo LLC


On Sunday, November 3, 2013 8:09:26 PM UTC-8, Andrew Galloway wrote:
>
> It is nearly always going to "buffer". The question is, is it going to
> also engage ZIL mechanics, and if so, where are those writes going to go.
> You have to look at your average block size, your COMSTAR LU's writeback
> cache setting, the manner of data you're writing (sync or not), the sync
> and logbias settings on the pool, and possibly also the cache flush
> tuneable.
>
> Your 1-1.5 GB/s numbers on SSD are in line with my own experiences for
> completely untuned ssd pools doing small block I/O with no consideration
> made for zil. I've seen considerably higher numbers, and markedly higher
> stable/repeatable numbers. There's a lot to touch on to get there, though.
> Far more than I'm willing to tap into this phone.
>
> I've never actually heard anyone comment on wear amplification before,
> that is somewhat interesting, but while there's likely some efficiencies to
> be had with tuning, proper environment setup, and possibly in code, I'd
> wager you're always going to have more than, say, ext4. That's just the
> nature of the beast.
>
> ZFS is not bad. However, what it also is not is designed for utmost
> performance. If nobody told you that before, that's unfortunate.
>
> That it is as performant as it is is awesome. Everything I have seen and
> heard is that it was primarily designed for RELIABILITY, not speed. In
> fact, I'd argue performance was a distant third on the agenda behind data
> integrity and ease of administration/use. The ARC and the fact that most
> use-cases are read-mostly on small subsets of total data can mislead people
> to believe ZFS is the fastest filesystem in the west. It is, but only in
> those scenarios.
>
> If you're mostly-write or have a huge uncacheable dataset, ZFS is
> definitely not fastest. It is one of the safest, still, but that safety
> comes at the expense of speed. The amount of effort ZFS puts into writing
> your data compared to a lot of other filesystems completely precludes it
> from winning any speed wars, and for myself, that's how I like it.
>
> - Andrew
> On Nov 3, 2013 5:46 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org<javascript:>>
> wrote:
>
>> Andrew,
>>
>> My tests without O_DIRECT are only marginally better. My understanding
>> is that O_DIRECT should still buffer inside of ZFS (unless there is a ZIL).
>>
>> My overall take on ZFS is somewhat harsh when considering large puse SSD
>> arrays.
>>
>> * ZFS has a hard time keeping up or even reaching 1GB/sec of random IO or
>> 1.5GB/sec of linear IO.
>> * ZFS has wear amplification ranging from 2:1 for simple, no redundancy,
>> stripe sets, to over 20:1 with triple raid parity. This likely gets worse
>> for full pools.
>>
>> If you think this is in error, then please let me know. I am not saying
>> that ZFS is "bad", it is just designed to address a different set of
>> problems in a different run-time environment.
>>
>> Doug Dumitru
>> EasyCo LLC
>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-01 19:23:49 UTC
Permalink
Perhaps, but that doesn't port to ZFS, though. For ZFS the stripe is always a power of 2 and 2^ashift <= stripe <= 128KB. If your typical write size is 4KB anything but mirroring almost certainly doesn't make sense.

If your app always writes an "optimal" amount in ZFS, make sure your vdevs all have 2^n + redundancy disks, and your app always writes 128KB (or a multiple thereof).



Doug Dumitru <dougdumitruredirect-***@public.gmane.org> wrote:

>
>
>On Friday, November 1, 2013 11:22:11 AM UTC-7, Gordan Bobic wrote:
>>
>> Err... 23-disk RAID5 is way beyond silly, both in terms of performance and
>> reliability.
>
>
>I understand the argument in terms of reliability. Then again, with SSDs,
>the rebuild time is quite fast, so the multi-disk error window is lower.
>
>In terms of performance it depends on the data patterns going to the
>array. My applications all write perfect raid stripes and perfect long
>blocks that match flash erase block boundaries. Thus write performance is
>"wickedly good".
>
>
>Doug Dumitru
>EasyCo LLC
>
>
>To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-01 20:21:20 UTC
Permalink
Beginning to suspect? You do realize that the rest of us here aren't guessing, right?


Doug Dumitru <dougdumitruredirect-***@public.gmane.org> wrote:

>
>
>On Friday, November 1, 2013 12:27:34 PM UTC-7, Christ Schlacta wrote:
>>
>> 22*4k means your application always writes 88k at a time atomically?
>> Wait.. flash erase is 8k. 22*8k is 176k. Even so, zfs well break it up
>> into 128k, and 64k padded. Because internal limitations of almost all
>> filesystems.
>>
>> If you write 4k blocks on an ashift=12 pool, each write will be one data
>> block plus parity, plus metadata. With raid z2 that's at least
>> 1data+2parity+2meta =5 before walking the merkle tree.
>>
>> You are seriously making a huge mistake writing 4k blocks to a raidz2 pool
>> with 22+2 disks.
>>
>> Either write larger blocks to smaller vdevs, or accept massive
>> amplification.
>>
>> 4+2= 16k or 32k.
>> 2+1= 8k or 16k.
>> 8+2= 32k or 64k.
>>
>> The only way 4k writes will ever be the right option for efficiency of
>> space and writes is on mirrored pairs.
>>
>I am beginning to suspect that this is the case. I re-ran my tests as 6 4
>drive raid-z1 sets:
>
>+ modprobe zfs zfs_arc_min=1073741824 zfs_arc_max=8589934592
>+ zpool create data raidz1 /dev/sdb /dev/sdc /dev/sdd /dev/sde raidz1
>/dev/sdf /dev/sdg /dev/sdh /dev/sdi raidz1 /dev/sdj /dev/sdk /dev/sdl
>/dev/sdm raidz1 /dev/sdn /dev/sdo /dev/sdp /dev/sdq raidz1 /dev/sdr
>/dev/sds /dev/sdt /dev/sdu raidz1 /dev/sdv /dev/sdw /dev/sdx /dev/sdy -f -o
>ashift=12
>+ zfs create -V 300G data/test
>+ zfs set logbias=throughput data/test
>
>Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
>avgqu-sz await r_await w_await svctm %util
>sda 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-0 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-1 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-2 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-3 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>dm-4 0.00 0.00 0.00 0.00 0.00 0.00
>0.00 0.00 0.00 0.00 0.00 0.00 0.00
>sdc 0.00 0.00 727.07 475.00 2.84 2.67
>9.38 0.21 0.17 0.17 0.17 0.11 12.87
>sdb 0.00 0.00 2027.73 3524.83 7.92 24.63
>12.01 0.95 0.17 0.19 0.16 0.07 39.67
>sdd 0.00 0.00 2097.47 3502.10 8.19 24.63
>12.01 0.96 0.17 0.20 0.16 0.07 40.45
>sde 0.00 0.00 656.73 3119.67 2.57 22.22
>13.44 0.60 0.16 0.18 0.16 0.06 21.56
>sdf 0.00 0.00 2033.37 3909.43 7.94 24.22
>11.08 1.04 0.18 0.20 0.17 0.07 40.85
>sdh 0.00 0.00 2078.67 3912.10 8.12 24.21
>11.05 1.05 0.18 0.20 0.17 0.07 41.43
>sdg 0.00 0.00 714.50 1847.47 2.79 12.58
>12.29 0.42 0.16 0.18 0.16 0.07 18.03
>sdi 0.00 0.00 668.10 2116.60 2.61 11.87
>10.65 0.48 0.17 0.19 0.17 0.07 18.28
>sdp 0.00 0.00 2186.20 3769.73 8.54 24.19
>11.25 1.09 0.18 0.26 0.14 0.07 43.72
>sdq 0.00 0.00 578.50 1774.63 2.26 10.90
>11.46 0.34 0.14 0.20 0.13 0.06 15.08
>sdj 0.00 0.00 2109.27 3700.73 8.24 24.62
>11.58 1.16 0.20 0.29 0.15 0.08 45.00
>sdk 0.00 0.00 662.57 2697.27 2.59 16.40
>11.57 0.52 0.16 0.22 0.14 0.06 19.48
>sdl 0.00 0.00 2034.93 3705.43 7.95 24.62
>11.62 1.16 0.20 0.30 0.15 0.08 44.92
>sdm 0.00 0.00 737.17 1306.17 2.88 8.45
>11.36 0.33 0.16 0.18 0.15 0.08 15.36
>sdn 0.00 0.00 1971.43 3799.23 7.70 24.19
>11.32 1.03 0.18 0.26 0.13 0.07 42.17
>sdo 0.00 0.00 793.47 2177.10 3.10 13.52
>11.46 0.45 0.15 0.18 0.14 0.06 18.63
>sdr 0.00 0.00 1940.23 3429.63 7.58 24.75
>12.33 0.95 0.18 0.19 0.17 0.07 38.72
>sds 0.00 0.00 824.23 1510.57 3.22 10.76
>12.26 0.39 0.17 0.18 0.16 0.08 18.28
>sdu 0.00 0.00 553.53 1981.27 2.16 14.23
>13.25 0.43 0.17 0.17 0.17 0.06 16.25
>sdt 0.00 0.00 2210.87 3421.73 8.64 24.76
>12.14 1.01 0.18 0.20 0.17 0.07 41.04
>sdy 0.00 0.00 801.83 2338.87 3.13 15.23
>11.97 0.55 0.17 0.19 0.17 0.07 21.67
>sdx 0.00 0.00 1967.53 3915.73 7.69 24.52
>11.21 1.05 0.18 0.21 0.17 0.07 40.96
>sdw 0.00 0.00 579.40 1651.80 2.26 9.53
>10.82 0.38 0.17 0.19 0.17 0.07 15.27
>sdv 0.00 0.00 2190.80 3921.77 8.56 24.51
>11.08 1.10 0.18 0.20 0.17 0.07 43.19
>zd0 0.00 0.00 0.00 11710.23 0.00 45.74
>8.00 9.91 0.85 0.00 0.85 0.09 100.00
>
>The performance and wear are a little worse than a single big raid-z1 set.
>
>I will run a mirrored set and post the results in a few minutes. Mirroring
>on SSDs is problematic in that you are trying to optimize $/GB.
>
>Doug Dumitru
>EasyCo LLC
>
>> On Nov 1, 2013 12:11 PM, "Doug Dumitru" <dougdumit...-***@public.gmane.org<javascript:>>
>> wrote:
>>
>>>
>>>
>>> On Friday, November 1, 2013 11:22:11 AM UTC-7, Gordan Bobic wrote:
>>>>
>>>> Err... 23-disk RAID5 is way beyond silly, both in terms of performance
>>>> and reliability.
>>>
>>>
>>> I understand the argument in terms of reliability. Then again, with
>>> SSDs, the rebuild time is quite fast, so the multi-disk error window is
>>> lower.
>>>
>>> In terms of performance it depends on the data patterns going to the
>>> array. My applications all write perfect raid stripes and perfect long
>>> blocks that match flash erase block boundaries. Thus write performance is
>>> "wickedly good".
>>>
>>>
>>> Doug Dumitru
>>> EasyCo LLC
>>>
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to zfs-discuss...-VKpPRiiRko4/***@public.gmane.org <javascript:>.
>>>
>>
>
>To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-04 18:05:55 UTC
Permalink
You might be better off with a HyperOS SATA RAM drive. Just add 32GB of DDR2 DIMMs.



Uncle Stoatwarbler <stoatwblr-***@public.gmane.org> wrote:

>On 04/11/13 15:43, Chris Siebenmann wrote:
>
>
>> I think that people may have a mis-perception of the ZIL and how it
>> works. ZFS pools *always* have a ZIL; the only question is whether it is
>> internal (written to regular data vdevs) or external on a separate log
>> device.
>
>I'm aware of that.
>
>For those not following this: if there is no ZIL device, ZIL data is
>striped across the main vdev(s), then written to the main filesystem and
>finally the ZIL stripe is removed, ensuring atomic data integrity but
>resulting in at at least a 2:1 write amplification.
>
>If you want to compare apples to apples on an ext4 filesystem, then it
>must be mounted using "data=journal", otherwise the ext4 system only
>journals metadata (data=ordered) and is susceptable to data loss in the
>event of a power failure. The ext4 FS should also be mounted with the
>journal_checksum parameter (which does what it says - creates and writes
>journal checksums for added corruption resistance)
>
>
>> As far as I know you'll get the same write amplification almost
>> regardless of where the ZIL lives, it's just a question of which device
>> sees it.
>
>Agreed - no matter where the ZIL is located, you will see at _least_ 2:1
>write amplification, but if you cripple ZFS layout it can end up a lot
>worse.
>
>
>The reasons for preferring a small, fast, possibly SLC, dedicated ZIL
>device (or mirror if you're paranoid) are:
>
>1: It is a lot easier to replace the ZIL than a bunch of larger drives
>if it wears out (it can even be done on a running system!)
>
>2: because of way the dedicated ZIL works, you're sending it sequential
>writes, not random ones (longer life)
>
>3: Assuming the device is not ridiculously small, it will coalesce most
>of the 4kb writes into something meaningful (and mostly sequential) for
>the main vdev, saving write-amplification issues there as well as
>avoiding the IOPS degradation that a direct random RW scenario causes.
>
>4: If zfs sync=always, as long as the ZIL device sequential write speed
>is adequate it will keeping up with writes as they happen. No need for
>o_direct games
>
>and
>
>5: It allows you to use cheaper, slower SSDs for the main vdev(s) than
>if you try to achieve the same IO loading with direct writes/reads. For
>this scenario the slowish vdev SSDs can be bracketed by high-performance
>cache/zil devices (assuming CPU/RAM/Network can keep up too)
>
>
>As always: YMMV.
>
>
>It's worth noting that every single vendor pitching ZFS at my dayjob is
>specifying 8Gb STEC ZeusRam drives for ZIL. These things are
>hellaciously expensive but they're pretty much guaranteed to keep up.
>
>
>To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-11-04 20:17:59 UTC
Permalink
> You might be better off with a HyperOS SATA RAM drive. Just add 32GB of
> DDR2 DIMMs.

That's probably too big and too slow.

I would use a caching raid controller, with BBU and cache configured to
only cache writes. As long as the SLOG is not configured larger then the
controller cache (and smart firmware), the drive speed shouldn't make a
difference.

This would be a cheap solution to have a very fast ZIL.

Niels


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Gordan Bobic
2013-11-04 20:20:48 UTC
Permalink
On 11/04/2013 08:17 PM, Niels de Carpentier wrote:
>> You might be better off with a HyperOS SATA RAM drive. Just add 32GB of
>> DDR2 DIMMs.
>
> That's probably too big and too slow.
>
> I would use a caching raid controller, with BBU and cache configured to
> only cache writes. As long as the SLOG is not configured larger then the
> controller cache (and smart firmware), the drive speed shouldn't make a
> difference.
>
> This would be a cheap solution to have a very fast ZIL.

I do believe mentioning caching RAID controllers is considered extreme
profanity on this list.

Gordan


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-04 20:25:47 UTC
Permalink
Horrible profanity.

Also a horrible idea. I don't even know how it even helps. The RAID card
cannot tell the difference between a ZIL write and a normal data write, so
it's going to have to write both back anyway. Having it cache the write is
only useful if the back-end device can keep up, still. It's not like it can
cache the ZIL write and then discard it once the real write occurs, because
again, it doesn't know these ZFS semantics and can't tell what writes are
'real' and what are 'log'.

- Andrew


On Mon, Nov 4, 2013 at 12:20 PM, Gordan Bobic <gordan.bobic-***@public.gmane.org>wrote:

> On 11/04/2013 08:17 PM, Niels de Carpentier wrote:
>
>> You might be better off with a HyperOS SATA RAM drive. Just add 32GB of
>>> DDR2 DIMMs.
>>>
>>
>> That's probably too big and too slow.
>>
>> I would use a caching raid controller, with BBU and cache configured to
>> only cache writes. As long as the SLOG is not configured larger then the
>> controller cache (and smart firmware), the drive speed shouldn't make a
>> difference.
>>
>> This would be a cheap solution to have a very fast ZIL.
>>
>
> I do believe mentioning caching RAID controllers is considered extreme
> profanity on this list.
>
> Gordan
>
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-11-04 20:42:30 UTC
Permalink
> Horrible profanity.
>
> Also a horrible idea. I don't even know how it even helps. The RAID card
> cannot tell the difference between a ZIL write and a normal data write, so
> it's going to have to write both back anyway. Having it cache the write is
> only useful if the back-end device can keep up, still. It's not like it
> can
> cache the ZIL write and then discard it once the real write occurs,
> because
> again, it doesn't know these ZFS semantics and can't tell what writes are
> 'real' and what are 'log'.

Well, the raid controller would only be used for the ZIL, and would only
need 1 connected drive (configured as a single drive stripe). Basically a
cheap disk backed RAM drive, with properties that are perfect for use as a
SLOG. I know generally you shouldn't use raid controllers, but I believe
it does make sense if used in this way.

You could also use some sort of RAM drive, but SATA ones are limited by
the SATA bus, and pci-e ones are extremely expensive.

I don't think there is a cheaper solution with the same performance. I
there is, I would love to hear it.

Niels



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-04 20:50:17 UTC
Permalink
I question that that would work. Write caching is all well and good, but it
to some degree has to rely on the backing storage for its speed. It either
has to start limiting the ingress or flat out rejecting it if the back-end
device ultimately can't keep up with the input. Thus, ultimately, the
back-end device determines the potential speed of this setup. The logic the
RAID card employs in that cache is also of importance -- ZFS, for instance,
expects the slog to not be reordered or put to disk out of the order in
which it was received; that's precisely why it sends along a CACHE SYNC
after every write. And if the RAID card isn't 'massaging' the data, then
all it could be doing is basically holding it and hoping there are peaks &
valleys in the incoming stream to let the back-end disk catch up in the
lull periods. If so, sure, you might see a minor improvement in overall
performance from this, but I don't think it would ever be worth the effort.
You'd probably be better off just using the RAM drive in the first place,
and letting ZFS deal with it when it is too slow/small to keep up with
ingress. Especially as we move forward and the smarter write code from
Delphix makes it into mainstream.

- Andrew


On Mon, Nov 4, 2013 at 12:42 PM, Niels de Carpentier
<zfs-zx3GLP/***@public.gmane.org>wrote:

> > Horrible profanity.
> >
> > Also a horrible idea. I don't even know how it even helps. The RAID card
> > cannot tell the difference between a ZIL write and a normal data write,
> so
> > it's going to have to write both back anyway. Having it cache the write
> is
> > only useful if the back-end device can keep up, still. It's not like it
> > can
> > cache the ZIL write and then discard it once the real write occurs,
> > because
> > again, it doesn't know these ZFS semantics and can't tell what writes are
> > 'real' and what are 'log'.
>
> Well, the raid controller would only be used for the ZIL, and would only
> need 1 connected drive (configured as a single drive stripe). Basically a
> cheap disk backed RAM drive, with properties that are perfect for use as a
> SLOG. I know generally you shouldn't use raid controllers, but I believe
> it does make sense if used in this way.
>
> You could also use some sort of RAM drive, but SATA ones are limited by
> the SATA bus, and pci-e ones are extremely expensive.
>
> I don't think there is a cheaper solution with the same performance. I
> there is, I would love to hear it.
>
> Niels
>
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Niels de Carpentier
2013-11-04 21:11:41 UTC
Permalink
> I question that that would work. Write caching is all well and good, but
> it
> to some degree has to rely on the backing storage for its speed. It either
> has to start limiting the ingress or flat out rejecting it if the back-end
> device ultimately can't keep up with the input. Thus, ultimately, the
> back-end device determines the potential speed of this setup. The logic
> the
> RAID card employs in that cache is also of importance -- ZFS, for
> instance,
> expects the slog to not be reordered or put to disk out of the order in
> which it was received; that's precisely why it sends along a CACHE SYNC
> after every write. And if the RAID card isn't 'massaging' the data, then
> all it could be doing is basically holding it and hoping there are peaks &
> valleys in the incoming stream to let the back-end disk catch up in the
> lull periods. If so, sure, you might see a minor improvement in overall
> performance from this, but I don't think it would ever be worth the
> effort.

A battery backed cache doesn't need to (and shouldn't for performance
reasons) honor CACHE SYNC. So if the firmware is smart, the drive speed
shouldn't matter, and cache entries can keep being overwritten without
ever being send to the disk. (This is why I mentioned smart firmware). You
would of course need to make sure the SLOG size always fits in the cache,
but a SLOG device generally can be pretty small.

> You'd probably be better off just using the RAM drive in the first place,
> and letting ZFS deal with it when it is too slow/small to keep up with
> ingress. Especially as we move forward and the smarter write code from
> Delphix makes it into mainstream.

Well, A proper pci-e RAM drive would be the standard and preffered
solution, but I don't think there are any affordable fast RAM drives.

SATA is too slow for use as a SLOG for an SSD array, and pci-e ones are
very expensive. It is likely cheaper to just use an array of striped SSD's
as an SLOG. (Is it possible to stripe more then 2 SLOG devices?)

Niels



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-04 21:16:50 UTC
Permalink
Yes, you can stripe more than 2 slog devices. Note that slog is round-robin
-- and single queue depth. It will not utilize more and more devices unless
it has 'time' to. If the device(s) utilized for log mechanics are high
enough latency, the addition of more of them will have literally zero
impact on ZIL performance, merely increase its total capacity.


On Mon, Nov 4, 2013 at 1:11 PM, Niels de Carpentier <zfs-zx3GLP/***@public.gmane.org>wrote:

> > I question that that would work. Write caching is all well and good, but
> > it
> > to some degree has to rely on the backing storage for its speed. It
> either
> > has to start limiting the ingress or flat out rejecting it if the
> back-end
> > device ultimately can't keep up with the input. Thus, ultimately, the
> > back-end device determines the potential speed of this setup. The logic
> > the
> > RAID card employs in that cache is also of importance -- ZFS, for
> > instance,
> > expects the slog to not be reordered or put to disk out of the order in
> > which it was received; that's precisely why it sends along a CACHE SYNC
> > after every write. And if the RAID card isn't 'massaging' the data, then
> > all it could be doing is basically holding it and hoping there are peaks
> &
> > valleys in the incoming stream to let the back-end disk catch up in the
> > lull periods. If so, sure, you might see a minor improvement in overall
> > performance from this, but I don't think it would ever be worth the
> > effort.
>
> A battery backed cache doesn't need to (and shouldn't for performance
> reasons) honor CACHE SYNC. So if the firmware is smart, the drive speed
> shouldn't matter, and cache entries can keep being overwritten without
> ever being send to the disk. (This is why I mentioned smart firmware). You
> would of course need to make sure the SLOG size always fits in the cache,
> but a SLOG device generally can be pretty small.
>
> > You'd probably be better off just using the RAM drive in the first place,
> > and letting ZFS deal with it when it is too slow/small to keep up with
> > ingress. Especially as we move forward and the smarter write code from
> > Delphix makes it into mainstream.
>
> Well, A proper pci-e RAM drive would be the standard and preffered
> solution, but I don't think there are any affordable fast RAM drives.
>
> SATA is too slow for use as a SLOG for an SSD array, and pci-e ones are
> very expensive. It is likely cheaper to just use an array of striped SSD's
> as an SLOG. (Is it possible to stripe more then 2 SLOG devices?)
>
> Niels
>
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 21:47:40 UTC
Permalink
On 04/11/13 21:11, Niels de Carpentier wrote:

> Well, A proper pci-e RAM drive would be the standard and preffered
> solution, but I don't think there are any affordable fast RAM drives.

So far I haven't found _any_ PCIe ram drives supported by linux(*), let
alone affordable ones.

(*) I did find a single device but it's BSD/Illumos only.

> SATA is too slow for use as a SLOG for an SSD array, and pci-e ones are
> very expensive. It is likely cheaper to just use an array of striped SSD's
> as an SLOG. (Is it possible to stripe more then 2 SLOG devices?)

Sata-express is being (slowly) rolled out, but IMHO latency is more
important for SLOG devices in the environment being tested.

Striping may not help much.





To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Uncle Stoatwarbler
2013-11-04 20:32:46 UTC
Permalink
On 04/11/13 18:05, Gordan Bobic wrote:
> You might be better off with a HyperOS SATA RAM drive. Just add 32GB of DDR2 DIMMs.

SATA-anything isn't fast enough if it's bracketing a SSD array, which is
one of the reasons I'm not taking any of the solutions pitched.

More to the point, vendors sell what they call "certified" solutions and
they won't push anything outside of their little "certification" bubble.

What scares me is that virtually all the vendors are pushing solutions
they don't really understand - and that includes discussions with the EU
technical folk for Nexsan and Infortrend amongst others. The single
biggest offence is "not enough ram" when pitching dedupe, followed by
trying to put hardware raid controllers between the appliance and the
drives.

There are a lot of badly laid out ZFS appliances hitting the market and
they're going to cause a lot of bad PR over the next couple of years.

I'm in a position where I've been able to stave off the purchase of
500Tb of storage for 8-12 months in the hope that things will sort
themselves out. Others will not be so lucky.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Schlacta, Christ
2013-11-04 20:37:45 UTC
Permalink
You'll be best off if you build your own appliance around a single hardware
vendor and their hcl. It'd pick motherboard, but is up to you :)
On Nov 4, 2013 12:32 PM, "Uncle Stoatwarbler" <stoatwblr-***@public.gmane.org> wrote:

> On 04/11/13 18:05, Gordan Bobic wrote:
>
>> You might be better off with a HyperOS SATA RAM drive. Just add 32GB of
>> DDR2 DIMMs.
>>
>
> SATA-anything isn't fast enough if it's bracketing a SSD array, which is
> one of the reasons I'm not taking any of the solutions pitched.
>
> More to the point, vendors sell what they call "certified" solutions and
> they won't push anything outside of their little "certification" bubble.
>
> What scares me is that virtually all the vendors are pushing solutions
> they don't really understand - and that includes discussions with the EU
> technical folk for Nexsan and Infortrend amongst others. The single biggest
> offence is "not enough ram" when pitching dedupe, followed by trying to put
> hardware raid controllers between the appliance and the drives.
>
> There are a lot of badly laid out ZFS appliances hitting the market and
> they're going to cause a lot of bad PR over the next couple of years.
>
> I'm in a position where I've been able to stave off the purchase of 500Tb
> of storage for 8-12 months in the hope that things will sort themselves
> out. Others will not be so lucky.
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
Andrew Galloway
2013-11-04 20:38:03 UTC
Permalink
This post might be inappropriate. Click to display it.
Continue reading on narkive:
Loading...