Discussion:
ZFS SEND very slow, queue depth of 1?
(too old to reply)
l***@gmail.com
2015-02-23 17:24:56 UTC
Permalink
It looks like zfs send is a highly serial transaction.

I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.

After carefully analysing IOSTAT results, I have noticed that ZFS send
builds the stream at no more than '1' operation at a time. I believe this
is the bottleneck

I am also using enterprise SSDs and those disks are not the bottleneck
either.

- Has anyone seen this kind of performance?
- Thoughts on doing a parallel tree walk to facilitate the zfs send as a
parallel operation

I have been trying to figure out how to parallelize this cleanly but
additional insights would be appreciated,


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Prakash Surya
2015-02-23 18:02:09 UTC
Permalink
There's work that we're doing at Delphix to increase the send/receive
performance. Unfortunately none of that work is yet in the open, but it
will be over the next couple months. Then it'll take some time and work
to push it upstream into illumos, and then some more time to trickle
down into ZFS on Linux.

I think the gist of the changes had to do with prefetching blocks that
are needed. IIRC, the send/receive code is a single thread that does a
read and them performs the operations that it needs on the block. To get
better usage of the backend storage, these reads can be prefetched, and
then when the single send/receive thread needs that block it should
already be cached in the ARC.
--
Cheers, Prakash
Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought
the bottleneck was the 1GIG network interfaces. But after several tests
with SSH, SCP, MBUFFER and NETCAT, I have ruled out network or
transport protocol errors.
After carefully analysing IOSTAT results, I have noticed that ZFS send
builds the stream at no more than '1' operation at a time. I believe
this is the bottleneck
I am also using enterprise SSDs and those disks are not the bottleneck
either.
- Has anyone seen this kind of performance?
- Thoughts on doing a parallel tree walk to facilitate the zfs send as
a parallel operation
I have been trying to figure out how to parallelize this cleanly but
additional insights would be appreciated,
To unsubscribe from this group and stop receiving emails from it, send
References
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
l***@gmail.com
2015-02-23 18:26:49 UTC
Permalink
Thanks Prakash. I have been investigating along the same lines. Tests show
that what you said is correct. The data needs to be in ARC in order to
speed up the SEND performance.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Durval Menezes
2015-02-23 18:20:09 UTC
Permalink
Hello,
Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.
After carefully analysing IOSTAT results, I have noticed that ZFS send
builds the stream at no more than '1' operation at a time. I believe this
is the bottleneck
I am also using enterprise SSDs and those disks are not the bottleneck
either.
- Has anyone seen this kind of performance?
- Thoughts on doing a parallel tree walk to facilitate the zfs send as a
parallel operation
I have been trying to figure out how to parallelize this cleanly but
additional insights would be appreciated,
Not to parallelize, but to at least smooth out the serial performance, many
folks insert a buffer(1)-like program somewhere in the pipeline.

Cheers,
--
Durval.
Post by l***@gmail.com
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
l***@gmail.com
2015-02-23 18:28:05 UTC
Permalink
Thanks Druval. I used mbuffer and it did not help. The performance in not
network or buffer related. It has to do with the way SEND builds the
stream, as Prakash noted.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
l***@gmail.com
2015-02-23 19:04:20 UTC
Permalink
@Prakash
There was another tests that I ran. When I turned the cache on for creating
a FS and it's snapshot, the SEND performance was excellent. This was
because the blocks were already in the ARC. But when I restarted the
system, the cache is empty, as expected, and you loose the performance
again.
I believe the solution you guys at Delphix is working on is great, but it
does not address the root cause. That is the SEND uses a single tree walk
and then writes to a single sector of the disk. We probably need a parallel
tree walker that uses range locks (already available in ZFS), that gets the
blocks in a parallel way and writes them to disk in a parallel manner too.

Thoughts?


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Kash Pande
2015-02-23 21:50:14 UTC
Permalink
Although there no way around the initial send time without reworking the
underlying send code, however, if you are sending identical data to
multiple hosts, you could do the following:

* one 'zfs send' per filesystem piped through pigz (gzip) or lz4c, save
the file to local storage
* md5sum on generated file
* As the files become available, send some sort of 'trigger' to your
recv hosts to rsync, verify, and recv that dataset in the background. (I
use a PHP script on each host with lighttpd that lets the sending side
tell the recv side to copy this file from a read-only NFS mount, copy it
via SSH or even netcat to local storage - whatever the user has configured.)

I have iSCSI servers storing Windows and other desktop images, with
updates coming from a single master server. I may have 5 or more servers
recving the same data at once, and when storing a file before
transferring it over the network the data is stored in ARC and you can
achieve transfer speeds in the multi-hundred MB/s range. Using this
method of replication I can reduce transfer time from hours to just 30
minutes or 1 hour.. it's significant.


Kash Pande
Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single
tree walk and then writes to a single sector of the disk. We probably
need a parallel tree walker that uses range locks (already available
in ZFS), that gets the blocks in a parallel way and writes them to
disk in a parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Prakash Surya
2015-02-24 14:57:19 UTC
Permalink
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash
Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single tree
walk and then writes to a single sector of the disk. We probably need a
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-24 15:00:40 UTC
Permalink
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.
Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash
Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single
tree
Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need
a
Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Isaac Huang
2015-02-24 19:58:09 UTC
Permalink
Looks like Oracle implemented "Sequential Resilvering":
https://docs.oracle.com/cd/E51475_01/html/E52874/maintenance__system__updates__sequential_resilvering___sequential_resilvering_.html

Maybe some ideas can be taken from there.

-Isaac
Post by Gordan Bobic
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.
Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash
Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single
tree
Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need
a
Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
l***@gmail.com
2015-02-24 20:39:37 UTC
Permalink
That is good to know. But this discussion is focused on the 'zfs send'
performance and I am not sure resilvering is related to that.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Prakash Surya
2015-02-24 21:07:15 UTC
Permalink
Unrelated to the OP's issue, but that's interesting to hear, and
definitely something we could do in OpenZFS. It'd just take somebody
with the necessary skills to have enough interest to do that work.
--
Cheers, Prakash
Post by Isaac Huang
https://docs.oracle.com/cd/E51475_01/html/E52874/maintenance__system__updates__sequential_resilvering___sequential_resilvering_.html
Maybe some ideas can be taken from there.
-Isaac
Post by Gordan Bobic
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.
Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash
Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single
tree
Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need
a
Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Andreas Dilger
2015-02-25 03:57:20 UTC
Permalink
Possibly this is based on the design we did a few years ago for improved ZFS scrub/resilver performance by sorting the blocks:

http://wiki.lustre.org/images/f/ff/Rebuild_performance-2009-06-15.pdf
Post by Isaac Huang
https://docs.oracle.com/cd/E51475_01/html/E52874/maintenance__system__updates__sequential_resilvering___sequential_resilvering_.html
Maybe some ideas can be taken from there.
-Isaac
Post by Gordan Bobic
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.
Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash
Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single
tree
Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need
a
Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References
To unsubscribe from this group and stop receiving emails from it, send an
Cheers, Andreas





To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
l***@gmail.com
2015-02-26 16:28:08 UTC
Permalink
@Gordon Bobic
@smt
If you examine the ZFS send code, you can see the root cause of the lack of
performance. The SEND code reads the blocks from a disk and builds the
stream, all in one thread. As @Prakash said, prefetching the blocks is one
way to alleviate the problem. The other way would be to have separate
threads do the read and build.
I am not certain why you not seeing these performance numbers, but from the
code it is clear that it should be seen.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-26 16:31:02 UTC
Permalink
Whether the performance issue is going to be observable sounds
fundamentally determinable by two factors:
1) Data fragmentation and seek time
2) Interconnect speed


If the data isn't very fragmented saturating a gigabit link isn't going to
be difficult even if it is running single-threaded.
Post by l***@gmail.com
@Gordon Bobic
@smt
If you examine the ZFS send code, you can see the root cause of the lack
of performance. The SEND code reads the blocks from a disk and builds the
way to alleviate the problem. The other way would be to have separate
threads do the read and build.
I am not certain why you not seeing these performance numbers, but from
the code it is clear that it should be seen.
To unsubscribe from this group and stop receiving emails from it, send an
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Gordan Bobic
2015-02-24 09:17:53 UTC
Permalink
Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.
Have you verified with:
zfs send mypool/***@mysnap | pv > /dev/null
?

I have never not managed to saturate a gigabit connection with zfs send.

I am also using enterprise SSDs and those disks are not the bottleneck
Post by l***@gmail.com
either.
So
iostat -x 1
isn't showing them as upward of 80% busy?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
AndCycle
2015-02-24 10:08:58 UTC
Permalink
I have similar experience with current ZoL 0.6.3,
zfs send from an Intel SSD 730 only gives me very low throughput, the
utilization is really low,
there is definitely performance issue somewhere

zfs send zroot/prod/lz4/default/***@zfs-auto-snap_hourly-2015-02-24-1000 |
pv -trab > /dev/null
8228MiB 0:00:49 [9.71MiB/s] [ 6.7MiB/s]

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 1117.00 208.00 14044.00 9708.00 35.85
0.24 0.18 0.16 0.29 0.11 14.00
Post by Gordan Bobic
Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.
?
I have never not managed to saturate a gigabit connection with zfs send.
I am also using enterprise SSDs and those disks are not the bottleneck
Post by l***@gmail.com
either.
So
iostat -x 1
isn't showing them as upward of 80% busy?
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
AndCycle
2015-02-24 10:16:56 UTC
Permalink
and, there is also performance issue with scrub,
it takes 16hrs to scrub 480G Intel SSD 730,
from iometer test shows when queue depth is 1, 730 have around 6000 iops,
480G divided by 16hrs you got about 8000 iops which shows the queue depth
hardly go over 1
Post by AndCycle
I have similar experience with current ZoL 0.6.3,
zfs send from an Intel SSD 730 only gives me very low throughput, the
utilization is really low,
there is definitely performance issue somewhere
| pv -trab > /dev/null
8228MiB 0:00:49 [9.71MiB/s] [ 6.7MiB/s]
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 1117.00 208.00 14044.00 9708.00 35.85
0.24 0.18 0.16 0.29 0.11 14.00
Post by Gordan Bobic
Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.
?
I have never not managed to saturate a gigabit connection with zfs send.
I am also using enterprise SSDs and those disks are not the bottleneck
Post by l***@gmail.com
either.
So
iostat -x 1
isn't showing them as upward of 80% busy?
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
l***@gmail.com
2015-02-24 14:19:07 UTC
Permalink
@Gordon Bobic

Yes, I have verified with by redirecting to /dev/null. The transfer speeds
very exactly the same (11 MB/sec over gigabit link)

iostat shows that disks are waiting for io. They can handle a lot more
writes and reads but while send is running, it is almost sitting idle at
times. My analysis showed the ZFS sends had a queue depth of 1.


To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.
Steve Thompson
2015-02-24 20:47:08 UTC
Permalink
Post by Gordan Bobic
I have never not managed to saturate a gigabit connection with zfs send.
Likewise. I have a 78 TB pool of 39 mirrored pairs (each 2 TB nearline
SAS) sending to a remote system over 1 GBE to a backup 95 TB pool of all
SATA disks (using several RAIDZ2 vdevs), and typically see 95-99 MB/sec
send performance.

Steve
Loading...