[zfs-discuss] ZFS SEND very slow, queue depth of 1?

Discussion:

[zfs-discuss] ZFS SEND very slow, queue depth of 1?

l***@gmail.com

2015-02-23 17:24:56 UTC

It looks like zfs send is a highly serial transaction.

I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.

After carefully analysing IOSTAT results, I have noticed that ZFS send
builds the stream at no more than '1' operation at a time. I believe this
is the bottleneck

I am also using enterprise SSDs and those disks are not the bottleneck
either.

- Has anyone seen this kind of performance?
- Thoughts on doing a parallel tree walk to facilitate the zfs send as a
parallel operation

I have been trying to figure out how to parallelize this cleanly but
additional insights would be appreciated,

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Prakash Surya

2015-02-23 18:02:09 UTC

There's work that we're doing at Delphix to increase the send/receive
performance. Unfortunately none of that work is yet in the open, but it
will be over the next couple months. Then it'll take some time and work
to push it upstream into illumos, and then some more time to trickle
down into ZFS on Linux.

I think the gist of the changes had to do with prefetching blocks that
are needed. IIRC, the send/receive code is a single thread that does a
read and them performs the operations that it needs on the block. To get
better usage of the backend storage, these reads can be prefetched, and
then when the single send/receive thread needs that block it should
already be cached in the ARC.
--
Cheers, Prakash

Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought
the bottleneck was the 1GIG network interfaces. But after several tests
with SSH, SCP, MBUFFER and NETCAT, I have ruled out network or
transport protocol errors.
After carefully analysing IOSTAT results, I have noticed that ZFS send
builds the stream at no more than '1' operation at a time. I believe
this is the bottleneck
I am also using enterprise SSDs and those disks are not the bottleneck
either.
- Has anyone seen this kind of performance?
- Thoughts on doing a parallel tree walk to facilitate the zfs send as
a parallel operation
I have been trying to figure out how to parallelize this cleanly but
additional insights would be appreciated,
To unsubscribe from this group and stop receiving emails from it, send
References

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

l***@gmail.com

2015-02-23 18:26:49 UTC

Thanks Prakash. I have been investigating along the same lines. Tests show
that what you said is correct. The data needs to be in ARC in order to
speed up the SEND performance.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Durval Menezes

2015-02-23 18:20:09 UTC

Hello,

Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.
After carefully analysing IOSTAT results, I have noticed that ZFS send
builds the stream at no more than '1' operation at a time. I believe this
is the bottleneck
I am also using enterprise SSDs and those disks are not the bottleneck
either.
- Has anyone seen this kind of performance?
- Thoughts on doing a parallel tree walk to facilitate the zfs send as a
parallel operation
I have been trying to figure out how to parallelize this cleanly but
additional insights would be appreciated,

Not to parallelize, but to at least smooth out the serial performance, many
folks insert a buffer(1)-like program somewhere in the pipeline.

Cheers,

--
Durval.

Post by l***@gmail.com
To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

l***@gmail.com

2015-02-23 18:28:05 UTC

Thanks Druval. I used mbuffer and it did not help. The performance in not
network or buffer related. It has to do with the way SEND builds the
stream, as Prakash noted.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

l***@gmail.com

2015-02-23 19:04:20 UTC

@Prakash
There was another tests that I ran. When I turned the cache on for creating
a FS and it's snapshot, the SEND performance was excellent. This was
because the blocks were already in the ARC. But when I restarted the
system, the cache is empty, as expected, and you loose the performance
again.
I believe the solution you guys at Delphix is working on is great, but it
does not address the root cause. That is the SEND uses a single tree walk
and then writes to a single sector of the disk. We probably need a parallel
tree walker that uses range locks (already available in ZFS), that gets the
blocks in a parallel way and writes them to disk in a parallel manner too.

Thoughts?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Kash Pande

2015-02-23 21:50:14 UTC

Although there no way around the initial send time without reworking the
underlying send code, however, if you are sending identical data to
multiple hosts, you could do the following:

* one 'zfs send' per filesystem piped through pigz (gzip) or lz4c, save
the file to local storage
* md5sum on generated file
* As the files become available, send some sort of 'trigger' to your
recv hosts to rsync, verify, and recv that dataset in the background. (I
use a PHP script on each host with lighttpd that lets the sending side
tell the recv side to copy this file from a read-only NFS mount, copy it
via SSH or even netcat to local storage - whatever the user has configured.)

I have iSCSI servers storing Windows and other desktop images, with
updates coming from a single master server. I may have 5 or more servers
recving the same data at once, and when storing a file before
transferring it over the network the data is stored in ARC and you can
achieve transfer speeds in the multi-hundred MB/s range. Using this
method of replication I can reduce transfer time from hours to just 30
minutes or 1 hour.. it's significant.

Kash Pande

Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single
tree walk and then writes to a single sector of the disk. We probably
need a parallel tree walker that uses range locks (already available
in ZFS), that gets the blocks in a parallel way and writes them to
disk in a parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Prakash Surya

2015-02-24 14:57:19 UTC

I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash

Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single tree
walk and then writes to a single sector of the disk. We probably need a
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Gordan Bobic

2015-02-24 15:00:40 UTC

It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.

Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash

Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single

tree

Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need

a

Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References

To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Isaac Huang

2015-02-24 19:58:09 UTC

Looks like Oracle implemented "Sequential Resilvering":
https://docs.oracle.com/cd/E51475_01/html/E52874/maintenance__system__updates__sequential_resilvering___sequential_resilvering_.html

Maybe some ideas can be taken from there.

-Isaac

Post by Gordan Bobic
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.

Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash

Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single

tree

Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need

a

Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References

To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

l***@gmail.com

2015-02-24 20:39:37 UTC

That is good to know. But this discussion is focused on the 'zfs send'
performance and I am not sure resilvering is related to that.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Prakash Surya

2015-02-24 21:07:15 UTC

Unrelated to the OP's issue, but that's interesting to hear, and
definitely something we could do in OpenZFS. It'd just take somebody
with the necessary skills to have enough interest to do that work.
--
Cheers, Prakash

Post by Isaac Huang
https://docs.oracle.com/cd/E51475_01/html/E52874/maintenance__system__updates__sequential_resilvering___sequential_resilvering_.html
Maybe some ideas can be taken from there.
-Isaac

Post by Gordan Bobic
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.

Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash

Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single

tree

Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need

a

Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References

To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Andreas Dilger

2015-02-25 03:57:20 UTC

Possibly this is based on the design we did a few years ago for improved ZFS scrub/resilver performance by sorting the blocks:

http://wiki.lustre.org/images/f/ff/Rebuild_performance-2009-06-15.pdf

Post by Isaac Huang
https://docs.oracle.com/cd/E51475_01/html/E52874/maintenance__system__updates__sequential_resilvering___sequential_resilvering_.html
Maybe some ideas can be taken from there.
-Isaac

Post by Gordan Bobic
It is also not clear if this applies to the whole data set or only the
initial part. I find that the scrub tends to be fairly slow to begin with
(first 5-10 minutes) while the metadata is traversed, but after that it
speeds up dramatically.

Post by Prakash Surya
I'm not an expert in the send/receive code, but I think it'd need
significant overhaul to make the two operations parallelize-able. It's
written in a way, that it expects the dataset to be traversed linearly
in both the send and receive code, IIRC. So, as a less invasive change,
properly leveraging prefetch is how we decided to address the issue. I'd
have to defer to some of the other ZFS folks at Delphix for the details.
--
Cheers, Prakash

Post by l***@gmail.com
@Prakash
There was another tests that I ran. When I turned the cache on for
creating a FS and it's snapshot, the SEND performance was excellent.
This was because the blocks were already in the ARC. But when I
restarted the system, the cache is empty, as expected, and you loose
the performance again.
I believe the solution you guys at Delphix is working on is great, but
it does not address the root cause. That is the SEND uses a single

tree

Post by l***@gmail.com
walk and then writes to a single sector of the disk. We probably need

a

Post by l***@gmail.com
parallel tree walker that uses range locks (already available in ZFS),
that gets the blocks in a parallel way and writes them to disk in a
parallel manner too.
Thoughts?
To unsubscribe from this group and stop receiving emails from it, send
References

To unsubscribe from this group and stop receiving emails from it, send an

Cheers, Andreas

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

l***@gmail.com

2015-02-26 16:28:08 UTC

@Gordon Bobic
@smt
If you examine the ZFS send code, you can see the root cause of the lack of
performance. The SEND code reads the blocks from a disk and builds the
stream, all in one thread. As @Prakash said, prefetching the blocks is one
way to alleviate the problem. The other way would be to have separate
threads do the read and build.
I am not certain why you not seeing these performance numbers, but from the
code it is clear that it should be seen.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Gordan Bobic

2015-02-26 16:31:02 UTC

Whether the performance issue is going to be observable sounds
fundamentally determinable by two factors:
1) Data fragmentation and seek time
2) Interconnect speed

If the data isn't very fragmented saturating a gigabit link isn't going to
be difficult even if it is running single-threaded.

Post by l***@gmail.com
@Gordon Bobic
@smt
If you examine the ZFS send code, you can see the root cause of the lack
of performance. The SEND code reads the blocks from a disk and builds the
way to alleviate the problem. The other way would be to have separate
threads do the read and build.
I am not certain why you not seeing these performance numbers, but from
the code it is clear that it should be seen.
To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Gordan Bobic

2015-02-24 09:17:53 UTC

Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.

Have you verified with:
zfs send mypool/***@mysnap | pv > /dev/null
?

I have never not managed to saturate a gigabit connection with zfs send.

I am also using enterprise SSDs and those disks are not the bottleneck

Post by l***@gmail.com
either.

So
iostat -x 1
isn't showing them as upward of 80% busy?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

AndCycle

2015-02-24 10:08:58 UTC

I have similar experience with current ZoL 0.6.3,
zfs send from an Intel SSD 730 only gives me very low throughput, the
utilization is really low,
there is definitely performance issue somewhere

zfs send zroot/prod/lz4/default/***@zfs-auto-snap_hourly-2015-02-24-1000 |
pv -trab > /dev/null
8228MiB 0:00:49 [9.71MiB/s] [ 6.7MiB/s]

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 1117.00 208.00 14044.00 9708.00 35.85
0.24 0.18 0.16 0.29 0.11 14.00

Post by Gordan Bobic

Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.

?
I have never not managed to saturate a gigabit connection with zfs send.
I am also using enterprise SSDs and those disks are not the bottleneck

Post by l***@gmail.com
either.

So
iostat -x 1
isn't showing them as upward of 80% busy?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

AndCycle

2015-02-24 10:16:56 UTC

and, there is also performance issue with scrub,
it takes 16hrs to scrub 480G Intel SSD 730,
from iometer test shows when queue depth is 1, 730 have around 6000 iops,
480G divided by 16hrs you got about 8000 iops which shows the queue depth
hardly go over 1

Post by AndCycle
I have similar experience with current ZoL 0.6.3,
zfs send from an Intel SSD 730 only gives me very low throughput, the
utilization is really low,
there is definitely performance issue somewhere
| pv -trab > /dev/null
8228MiB 0:00:49 [9.71MiB/s] [ 6.7MiB/s]
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 1117.00 208.00 14044.00 9708.00 35.85
0.24 0.18 0.16 0.29 0.11 14.00

Post by Gordan Bobic

Post by l***@gmail.com
It looks like zfs send is a highly serial transaction.
I have been doing some tests with ZFS send/recv. Initially I thought the
bottleneck was the 1GIG network interfaces. But after several tests with
SSH, SCP, MBUFFER and NETCAT, I have ruled out network or transport
protocol errors.

?
I have never not managed to saturate a gigabit connection with zfs send.
I am also using enterprise SSDs and those disks are not the bottleneck

Post by l***@gmail.com
either.

So
iostat -x 1
isn't showing them as upward of 80% busy?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

l***@gmail.com

2015-02-24 14:19:07 UTC

@Gordon Bobic

Yes, I have verified with by redirecting to /dev/null. The transfer speeds
very exactly the same (11 MB/sec over gigabit link)

iostat shows that disks are waiting for io. They can handle a lot more
writes and reads but while send is running, it is almost sitting idle at
times. My analysis showed the ZFS sends had a queue depth of 1.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+***@zfsonlinux.org.

Steve Thompson

2015-02-24 20:47:08 UTC

Post by Gordan Bobic
I have never not managed to saturate a gigabit connection with zfs send.

Likewise. I have a 78 TB pool of 39 mirrored pairs (each 2 TB nearline
SAS) sending to a remote system over 1 GBE to a backup 95 TB pool of all
SATA disks (using several RAIDZ2 vdevs), and typically see 95-99 MB/sec
send performance.

Steve

19 Replies
396 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

l***@gmail.com 2015-02-23 17:24:56 UTC

Prakash Surya 2015-02-23 18:02:09 UTC

l***@gmail.com 2015-02-23 18:26:49 UTC

Durval Menezes 2015-02-23 18:20:09 UTC

l***@gmail.com 2015-02-23 18:28:05 UTC

l***@gmail.com 2015-02-23 19:04:20 UTC

Kash Pande 2015-02-23 21:50:14 UTC

Prakash Surya 2015-02-24 14:57:19 UTC

Gordan Bobic 2015-02-24 15:00:40 UTC

Isaac Huang 2015-02-24 19:58:09 UTC

l***@gmail.com 2015-02-24 20:39:37 UTC

Prakash Surya 2015-02-24 21:07:15 UTC

Andreas Dilger 2015-02-25 03:57:20 UTC

l***@gmail.com 2015-02-26 16:28:08 UTC

Gordan Bobic 2015-02-26 16:31:02 UTC

Gordan Bobic 2015-02-24 09:17:53 UTC

AndCycle 2015-02-24 10:08:58 UTC

AndCycle 2015-02-24 10:16:56 UTC

l***@gmail.com 2015-02-24 14:19:07 UTC

Steve Thompson 2015-02-24 20:47:08 UTC

about - legalese

Loading...