Please help me find why my ZFS is so much slower than md raid.

Discussion:

a***@public.gmane.org

12 years ago

I have a new storage server with 16 disks. I build native zfs on this
centos6 and I get constantly slower benchmarks with ZFS that with md raid.
Using aligned partitions and ashift=12 doesn't help.

These are the numbers:

- md raid10 defaults: 1100 iops, 1200 MB/s read, 460 MB/s write
- md raid6 defaults: 850 iops, 1900 MB/s read, 380 MB/s write
- ZFS stripped mirror: 550 iops, 750 MB/s read, 440 MB/s write
- ZFS raidz2, raidz3: 280 iops, 800 MB/s read, 400 MB/s write
-

So basicly, my iops drops to a 1disk level and read performance is 1/3 that
of md raid. ZFS does not use dedup, nor compression.

This is the machine:

- Supermicro X9SCL/X9SCM
- Intel i3-2120
- 4 GB ECC ( will be 32GB eventually with SSD cache, if ZFS checks out)
- 16x 2000 GB ST2000DM001 Advanced format disks (smartctl reports 4096
bytes physical sectors)
- HBA: LSI Logic / Symbios Logic SAS2116
- no multiplier
- ZFS: build from
src.rpm: zfs-test-0.6.0-rc14.el6.x86_64, zfs-modules-0.6.0-rc14_2.6.32_279.22.1.el6.x86_64, zfs-devel-0.6.0-rc14.el6.x86_64, zfs-dracut-0.6.0-rc14.el6.x86_64, zfs-modules-devel-0.6.0-rc14_2.6.32_279.22.1.el6.x86_64, zfs-0.6.0-rc14.el6.x86_64

These are my commands:

# md
mdadm --create /dev/md2 --level 6 --raid-devices=16 /dev/disk/by-id/scsi-SATA_ST2000DM001-1CH_Z* --assume-clean --bitmap=internal

# zfs
zpool create-o ashift=12 tank raidz2 /dev/disk/by-id/scsi-SATA_ST2000DM001-1CH_Z???????

Bryn Hughes

12 years ago

Permalink

...

Don't forget ZFS is doing quite a lot of extra data integrity work that
MD isn't doing. MD doesn't do file-level checksums for instance as it
has no filesystem knowledge whatsoever.

You also don't mention what filesystem you are using on MD. Are you
comparing raw MD or are you comparing MD + some other FS? The numbers
will be different again depending on what you are using.

I notice this line:

* ZFS raidz2, raidz3: 280 iops, 800 MB/s read, 400 MB/s write

Do you mean both raidz2 and raidz3 are getting the same performance? If
so that could for instance indicate you are CPU bound creating checksums
rather than servicing raw I/O.

It would be helpful to know some more details about your configuration.
If you could answer these questions it'd help:

- Are you using a filesystem on top of MD for your tests
(ext3/ext4/xfs/etc)?
- What tool are you using to measure performance?

Bryn

a***@public.gmane.org

12 years ago

Permalink

ZFS raidz2, raidz3: 280 iops, 800 MB/s read, 400 MB/s write

Post by Bryn Hughes
Do you mean both raidz2 and raidz3 are getting the same performance? If
so that could for instance indicate you are CPU bound creating checksums
rather than servicing raw I/O.

very similar performance.

Post by Bryn Hughes
It would be helpful to know some more details about your configuration.
- Are you using a filesystem on top of MD for your tests
(ext3/ext4/xfs/etc)?

on top of md raid, i used LVM and ext4 for testing.

Post by Bryn Hughes
- What tool are you using to measure performance?

bonnie++, here are the actual results: http://jsbin.com/unelem/1

Post by Bryn Hughes
Bryn

Jorrit Folmer

12 years ago

Permalink

Post by a***@public.gmane.org
* md raid10 defaults: 1100 iops, 1200 MB/s read, 460 MB/s write
* md raid6 defaults: 850 iops, 1900 MB/s read, 380 MB/s write
* ZFS stripped mirror: 550 iops, 750 MB/s read, 440 MB/s write
* ZFS raidz2, raidz3: 280 iops, 800 MB/s read, 400 MB/s write
So basicly, my iops drops to a 1disk level and read performance is 1/3
that of md raid. ZFS does not use dedup, nor compression.

Beware that bonnie++ doesn't benchmark random writes.
For a 16 disk md raid6 you'll find that it will do ~130 iops for 4k
random writes. Adding an internal bitmap like you did will even halve
the random writes: ~70, below single disk performance for a 16 disk set
because of the extra bookkeeping for each write.

If you're doing mostly readonly stuff, md raid6 would be great. For
other loads it be less great.

For ZFS raidz2 the random iops are largely reversed: low random read
iops , highish random write iops.

Taken together, for both md raid6 and raidz2 you would look at striping
several raid(z)6 sets to get a more balanced performance profile.

However, for sequential workloads mdraid does indeed outperform ZFS at
the expense of a number of features that you may or may not need.
You'll find that md raidz6 sequential write performance will improve
greatly if you adjust /sys/block/md0/md/stripe_cache_size to something
like 1024, 2048, 4096 or 8192

Jorrit Folmer

Gordan Bobic

12 years ago

Permalink

...

You have omitted the most important thing - how are you testing?

Gordan

Cyril Plisko

12 years ago

Permalink

Post by a***@public.gmane.org
md raid10 defaults: 1100 iops, 1200 MB/s read, 460 MB/s write
md raid6 defaults: 850 iops, 1900 MB/s read, 380 MB/s write
ZFS stripped mirror: 550 iops, 750 MB/s read, 440 MB/s write
ZFS raidz2, raidz3: 280 iops, 800 MB/s read, 400 MB/s write
So basicly, my iops drops to a 1disk level and read performance is 1/3 that
of md raid. ZFS does not use dedup, nor compression.

This is to be expected. One single RAIDZx group gives you IOPS of the
slowest disk in the group. This is the price you pay for other
interesting things, like absence of the write hole, no reconstruction
penalty on fault, etc. When higher IOPS rate is required you may want
to use multiple RAIDZx vdevs (provided you have enough drives to
populate them.
In general (as mentioned in this thread and in numerous other places)
all the bells and whistles of ZFS come with a price tag, and this
price tag especially visible on small setups.
Putting it in other words - you can get really good performance if
correctness is not a requirement.

Matthew Robbetts

12 years ago

Permalink

This is to be expected...

Hmmm, is it also to be expected that a ZFS striped mirror is so much
slower than Raid10 on IOPs and streaming reads? I would have thought
they would perform similarly at least.

Cyril Plisko

12 years ago

Permalink

Post by Matthew Robbetts
Hmmm, is it also to be expected that a ZFS striped mirror is so much
slower than Raid10 on IOPs and streaming reads? I would have thought
they would perform similarly at least.

That is impossible to tell without knowing all the gory details of the
experiment that yielded these numbers. In general yes, the expectation
is that mirror will perform similarly in both cases.

--
Regards,
Cyril

Reinis Rozitis

12 years ago

Permalink

zfs zpool create-o ashift=12 tank raidz2
/dev/disk/by-id/scsi-SATA_ST2000DM001-1CH_Z???????

Is this the real command?
I mean are you putting all 16 disks into a single vdev?

You would be better off splitting this into 2 vdevs eg: zfs zpool create -o
ashift=12 tank raidz2 [8 devices] raidz2 [8 devices]

rr

a***@public.gmane.org

12 years ago

Permalink

Post by Reinis Rozitis

zfs zpool create-o ashift=12 tank raidz2
/dev/disk/by-id/scsi-SATA_ST2000DM001-1CH_Z???????

Is this the real command?
I mean are you putting all 16 disks into a single vdev?
You would be better off splitting this into 2 vdevs eg: zfs zpool create -o
ashift=12 tank raidz2 [8 devices] raidz2 [8 devices]

I did this 2x8disk raidz2 and my results:
iops: 280, read: 700 MB/s, write: 600 MB/s

I also did a 4x4disk raidz :
iops: 320, read: 600 MB/s, write: 650 MB/s

I will try to use "fio" for random write tests and report here.

Andreas Dilger

12 years ago

Permalink

I have a new storage server with 16 disks. I build native zfs on this centos6 and I get constantly slower benchmarks with ZFS that with md raid. Using aligned partitions and ashift=12 doesn't help.

Are you using partitions below ZFS instead of whole disk devices? IIRC this disables some of the optimizations ZFS can do with cache flushing.

Cheers, Andreas

...