Understanding ZFS snapshots at the block level within a file

Discussion:

Understanding ZFS snapshots at the block level within a file

Lists

2014-07-02 01:16:00 UTC

If you save a file to a ZFS file system, and then overwrite the entire
file with very similar content, does
ZFS reallocate a new set of blocks for the whole file, or is it "smart
enough" to write only the changes in the
file being rewritten over?

I'm trying to optimize our database backup scheme using ZFS, and knowing
exactly what ZFS is doing
might make for dramatic improvement in space utilization.

We do Postgresql database dumps the usual way with pg_dump. The dumps
are performed via cron scripts that run
hourly. We currently keep backups of these for 24 hours, then we keep
the 10 PM backup in an archival form pretty
much forever. Nightly a script makes a tar of the hourly onto external
media, and then rsync the files offsite.

However, the hourly DB dumps take up a *lot* of space. It's sad, really,
because the data is both highly compressible
and highly duplicative. There is almost never even a 1% change from one
hour to the next. We once experimented with
using diff to keep only the changes from one dump to the next, but that
was far too processor intensive to be feasible.

### what we do now
EG:
$ pg_dump -u $dbuser $database -h $dbhost >
/path/to/backups/$database.$hour.pgsql
# 10 pm
$ tar -zcf /path/to/archive/$database.$date.pgsql.tgz
/path/to/backups/$database.05.pgsql
$ rsync -vaz --delete /path/to/archive/
***@archiveserver:/path/to/archive/
### end what we do now

Reference following scenario. If ZFS is smart enough to note block level
changes within a single file being
overwritten, then this might be extremely space efficient, since only
the changes from one hour to the
next are being allocated and written to by ZFS. In this view, ZFS
essentially would be doing a diff and writing
the changes in the new snapshot.

### scenario
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.pgsql
zfs snapshot zpool/dbbackups-***@public.gmane.org$hour
/// 1 hour later
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.pgsql
zfs snapshot zpool/dbbackups-***@public.gmane.org$hour
# ... repeat ...
### end scenario

The money question I guess is: In this scenario, would ZFS store 24
copies of the database dump in its
snapshots, or 24 sets of changes to a single database dump file? Note
that the file name within the file system
is the same, the zfs snapshots would provide the ability to see what
hour of backup is available.

Thanks,

Ben

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Fajar A. Nugraha

2014-07-02 03:42:04 UTC

Post by Lists
If you save a file to a ZFS file system, and then overwrite the entire
file with very similar content, does
ZFS reallocate a new set of blocks for the whole file, or is it "smart
enough" to write only the changes in the
file being rewritten over?

The first one.

Post by Lists
I'm trying to optimize our database backup scheme using ZFS, and knowing
exactly what ZFS is doing
might make for dramatic improvement in space utilization.
We do Postgresql database dumps the usual way with pg_dump. The dumps are
performed via cron scripts that run
hourly. We currently keep backups of these for 24 hours, then we keep the
10 PM backup in an archival form pretty
much forever. Nightly a script makes a tar of the hourly onto external
media, and then rsync the files offsite.
However, the hourly DB dumps take up a *lot* of space. It's sad, really,
because the data is both highly compressible
and highly duplicative. There is almost never even a 1% change from one
hour to the next. We once experimented with
using diff to keep only the changes from one dump to the next, but that
was far too processor intensive to be feasible.

One way is to use zfs snapshot + rsync --inplace. Note the importance of
"--inplace".

I'm not sure how well it handles a db dump though, like what happens if a
content (e.g. dump of a table) is shifted several MBs further due to
content insertion (e.g. several hundred row addition in a previously-dumped
table). My guess is "not so well".

Post by Lists
### what we do now
$ pg_dump -u $dbuser $database -h $dbhost > /path/to/backups/$database.$
hour.pgsql
# 10 pm
$ tar -zcf /path/to/archive/$database.$date.pgsql.tgz
/path/to/backups/$database.05.pgsql
to/archive/
### end what we do now
Reference following scenario. If ZFS is smart enough to note block level
changes within a single file being
overwritten, then this might be extremely space efficient, since only the
changes from one hour to the
next are being allocated and written to by ZFS. In this view, ZFS
essentially would be doing a diff and writing
the changes in the new snapshot.
### scenario
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.
pgsql
/// 1 hour later
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.
pgsql
# ... repeat ...
### end scenario
The money question I guess is: In this scenario, would ZFS store 24 copies
of the database dump in its
snapshots, or 24 sets of changes to a single database dump file? Note that
the file name within the file system
is the same, the zfs snapshots would provide the ability to see what hour
of backup is available.

Is pstgre's files located in zfs as well? It would be MUCH easier if you
just snapshot the actual db files (instead of the dump), and then copy the
delta offisite (e.g. "zfs send -I ..."), either manually or using something
like simplesnap.

--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Kenneth Henderick

2014-07-02 05:13:00 UTC

I think that might indeed be the main issue with snapshotting dumps with
ZFS, while the change is relatively small, it can change the whole data
pattern of the dump by shifting everything.

However, I think you shouldn't combine an old-fashioned pg_dump with the
more modern filesystem snapshot approach. You could just make ZFS snapshots
from the live running database if you make sure the write ahead logs
(inside the pg_xlog directory) are on the same filesystem. A restore from
such backups is basically using the same approach as a crash recovery, it
takes the last correct checkpoint in the inconsistent data structure, and
then starts replaying the logs to get to a consistent state.

More info can be found at:
http://www.postgresql.org/docs/9.1/static/continuous-archiving.html

Post by Fajar A. Nugraha

Post by Lists
If you save a file to a ZFS file system, and then overwrite the entire
file with very similar content, does
ZFS reallocate a new set of blocks for the whole file, or is it "smart
enough" to write only the changes in the
file being rewritten over?

The first one.

Post by Lists
I'm trying to optimize our database backup scheme using ZFS, and knowing
exactly what ZFS is doing
might make for dramatic improvement in space utilization.
We do Postgresql database dumps the usual way with pg_dump. The dumps are
performed via cron scripts that run
hourly. We currently keep backups of these for 24 hours, then we keep the
10 PM backup in an archival form pretty
much forever. Nightly a script makes a tar of the hourly onto external
media, and then rsync the files offsite.
However, the hourly DB dumps take up a *lot* of space. It's sad, really,
because the data is both highly compressible
and highly duplicative. There is almost never even a 1% change from one
hour to the next. We once experimented with
using diff to keep only the changes from one dump to the next, but that
was far too processor intensive to be feasible.

One way is to use zfs snapshot + rsync --inplace. Note the importance of
"--inplace".
I'm not sure how well it handles a db dump though, like what happens if a
content (e.g. dump of a table) is shifted several MBs further due to
content insertion (e.g. several hundred row addition in a previously-dumped
table). My guess is "not so well".

Post by Lists
### what we do now
$ pg_dump -u $dbuser $database -h $dbhost > /path/to/backups/$database.$
hour.pgsql
# 10 pm
$ tar -zcf /path/to/archive/$database.$date.pgsql.tgz
/path/to/backups/$database.05.pgsql
to/archive/
### end what we do now
Reference following scenario. If ZFS is smart enough to note block level
changes within a single file being
overwritten, then this might be extremely space efficient, since only the
changes from one hour to the
next are being allocated and written to by ZFS. In this view, ZFS
essentially would be doing a diff and writing
the changes in the new snapshot.
### scenario
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.
pgsql
/// 1 hour later
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.
pgsql
# ... repeat ...
### end scenario
The money question I guess is: In this scenario, would ZFS store 24
copies of the database dump in its
snapshots, or 24 sets of changes to a single database dump file? Note
that the file name within the file system
is the same, the zfs snapshots would provide the ability to see what hour
of backup is available.

Is pstgre's files located in zfs as well? It would be MUCH easier if you
just snapshot the actual db files (instead of the dump), and then copy the
delta offisite (e.g. "zfs send -I ..."), either manually or using something
like simplesnap.
--
Fajar
To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Lists

2014-07-02 06:18:29 UTC

Post by Fajar A. Nugraha
Is pstgre's files located in zfs as well? It would be MUCH easier if
you just snapshot the actual db files (instead of the dump), and then
copy the delta offisite (e.g. "zfs send -I ..."), either manually or
using something like simplesnap.

in order to do this and have a clean snapshot, we'd have to shut down
Postgres, make the snapshot, and then start. Even that few seconds of
downtime every hour is mostly untenable. Is there a way to get what I
want without using diff? Perhaps an rsync option?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Fajar A. Nugraha

2014-07-02 07:12:35 UTC

Post by Lists

Post by Fajar A. Nugraha
Is pstgre's files located in zfs as well? It would be MUCH easier if you
just snapshot the actual db files (instead of the dump), and then copy the
delta offisite (e.g. "zfs send -I ..."), either manually or using something
like simplesnap.

in order to do this and have a clean snapshot, we'd have to shut down
Postgres, make the snapshot, and then start. Even that few seconds of
downtime every hour is mostly untenable. Is there a way to get what I want
without using diff? Perhaps an rsync option?

Most dbs (e.g. oracle, mysql) doesn't need a clean snapshot. They can
work with unclean snapshot IF:
- all files are snapshotted are the same time. It can be on the same
zfs dataset, or different datasets but snapshotted with "zfs snapshot
-r"
- the db have some kind of transaction log
- the db can use the transaction log to replay/rollback currently
running transactions to recover the db back to consistent state during
next startup

The recovery process should be similar to what would happen if there
is a power outage. AFAIK postgres should also support that, provided
it is setup correctly.

Any db that can't recover from a power outage should be thrown away, IMHO.

--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Andrew Holway

2014-07-02 07:17:16 UTC

Can you use the pg_start_backup function to quiesce the database to get a
consistent snapshot?

http://www.postgresql.org/docs/9.3/static/functions-admin.html

There should be no need for SQL dumps in 2014 :)

Cheers,

Andrew

Post by Fajar A. Nugraha

Post by Lists

Post by Fajar A. Nugraha
Is pstgre's files located in zfs as well? It would be MUCH easier if you
just snapshot the actual db files (instead of the dump), and then copy

the

Post by Lists

Post by Fajar A. Nugraha
delta offisite (e.g. "zfs send -I ..."), either manually or using

something

Post by Lists

Post by Fajar A. Nugraha
like simplesnap.

in order to do this and have a clean snapshot, we'd have to shut down
Postgres, make the snapshot, and then start. Even that few seconds of
downtime every hour is mostly untenable. Is there a way to get what I

want

Post by Lists
without using diff? Perhaps an rsync option?

Most dbs (e.g. oracle, mysql) doesn't need a clean snapshot. They can
- all files are snapshotted are the same time. It can be on the same
zfs dataset, or different datasets but snapshotted with "zfs snapshot
-r"
- the db have some kind of transaction log
- the db can use the transaction log to replay/rollback currently
running transactions to recover the db back to consistent state during
next startup
The recovery process should be similar to what would happen if there
is a power outage. AFAIK postgres should also support that, provided
it is setup correctly.
Any db that can't recover from a power outage should be thrown away, IMHO.
--
Fajar
To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Andrew Holway

2014-07-02 07:18:42 UTC

I meant to post this snippet that I found also:

psql -c "select pg_start_backup('whatever');" && zfs snapshot

Post by Andrew Holway
Can you use the pg_start_backup function to quiesce the database to get a
consistent snapshot?
http://www.postgresql.org/docs/9.3/static/functions-admin.html
There should be no need for SQL dumps in 2014 :)
Cheers,
Andrew

Post by Fajar A. Nugraha

Post by Lists

Post by Fajar A. Nugraha
Is pstgre's files located in zfs as well? It would be MUCH easier if

you

Post by Lists

Post by Fajar A. Nugraha
just snapshot the actual db files (instead of the dump), and then copy

the

Post by Lists

Post by Fajar A. Nugraha
delta offisite (e.g. "zfs send -I ..."), either manually or using

something

Post by Lists

Post by Fajar A. Nugraha
like simplesnap.

in order to do this and have a clean snapshot, we'd have to shut down
Postgres, make the snapshot, and then start. Even that few seconds of
downtime every hour is mostly untenable. Is there a way to get what I

want

Post by Lists
without using diff? Perhaps an rsync option?

Most dbs (e.g. oracle, mysql) doesn't need a clean snapshot. They can
- all files are snapshotted are the same time. It can be on the same
zfs dataset, or different datasets but snapshotted with "zfs snapshot
-r"
- the db have some kind of transaction log
- the db can use the transaction log to replay/rollback currently
running transactions to recover the db back to consistent state during
next startup
The recovery process should be similar to what would happen if there
is a power outage. AFAIK postgres should also support that, provided
it is setup correctly.
Any db that can't recover from a power outage should be thrown away, IMHO.
--
Fajar
To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Fajar A. Nugraha

2014-07-02 08:17:22 UTC

From what I read on

http://www.postgresql.org/docs/9.3/static/continuous-archiving.html,
you'd still end up with a log replay anyway. And pg_start_backup is
designed for "normal", non-snapshot copy operation (e.g. backup using
"tar").

Since you have zfs snapshot, and postgres can recover from
crash-caused-by-power-outage, you shouldn't need pg_start_backup.
Again, this is assuming that all postgres files are on the same fs, or
can be snapshotted using "-r", which (from crash recovery point of
view) makes it the same as a power outage.
--
Fajar

Can you use the pg_start_backup function to quiesce the database to get a
consistent snapshot?
http://www.postgresql.org/docs/9.3/static/functions-admin.html
There should be no need for SQL dumps in 2014 :)
Cheers,
Andrew

Post by Fajar A. Nugraha

Post by Lists

Post by Fajar A. Nugraha
Is pstgre's files located in zfs as well? It would be MUCH easier if you
just snapshot the actual db files (instead of the dump), and then copy the
delta offisite (e.g. "zfs send -I ..."), either manually or using something
like simplesnap.

in order to do this and have a clean snapshot, we'd have to shut down
Postgres, make the snapshot, and then start. Even that few seconds of
downtime every hour is mostly untenable. Is there a way to get what I want
without using diff? Perhaps an rsync option?

Most dbs (e.g. oracle, mysql) doesn't need a clean snapshot. They can
- all files are snapshotted are the same time. It can be on the same
zfs dataset, or different datasets but snapshotted with "zfs snapshot
-r"
- the db have some kind of transaction log
- the db can use the transaction log to replay/rollback currently
running transactions to recover the db back to consistent state during
next startup
The recovery process should be similar to what would happen if there
is a power outage. AFAIK postgres should also support that, provided
it is setup correctly.
Any db that can't recover from a power outage should be thrown away, IMHO.
--
Fajar
To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Tim Chase

2014-07-02 18:33:26 UTC

Post by Fajar A. Nugraha
Most dbs (e.g. oracle, mysql) doesn't need a clean snapshot. They can
- all files are snapshotted are the same time. It can be on the same
zfs dataset, or different datasets but snapshotted with "zfs snapshot
-r"

PostgreSQL works just fine with filesystem-level snapshots so long as
they're performed atomically across all filesystems on which the cluster
is stored. This is spelled out fairly clearly in section 24.2 of the
manual. The interesting text starts in the paragraph beginning with "An
alternative file-system backup".

I'll note that as of 0.6.3, zfsonlinux does support atomically
snapshotting multiple filesystems so long as they're in the same pool
(this is a generalization of the "-r" functionality and came along with
the port of libzfs_core from illumos).

- Tim

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Lists

2014-07-02 19:32:22 UTC

Post by Tim Chase
PostgreSQL works just fine with filesystem-level snapshots so long as
they're performed atomically across all filesystems on which the
cluster is stored. This is spelled out fairly clearly in section 24.2
of the manual. The interesting text starts in the paragraph beginning
with "An alternative file-system backup".

I would agree that PG and filesystem level snapshots would work, but
this would have a couple of disadvantages:

1) Running PG on top of ZFS directly would have a significant
performance hit. Currently, we're running on EXT4 partitions on
Enterprise SSDs and use gobs and gobs of RAM for maximum performance.

2) We use the DB dumps in order to feed our dev environment. This lets
us test with real data. Doing this at the filesystem level (EG:
/var/lib/pgsql on ZFS) means that we're replicating the whole cluster,
not just a single customer's database - in our app, each customer has
their own database. In our production environment, we have a total of 5
production DB servers that replicate to a hot standby. When treating a
bug report, we load that customer's database into a dev environment DB
server so that we can reproduce the problem exactly. Doing this at the
filesystem level would be highly disruptive for us.

Turning the question on its ear: is there *any* file system that would
only write the changed blocks in a similar file? If so, we could export
a zvol and format it with that file system...

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Fajar A. Nugraha

2014-07-03 04:23:58 UTC

Post by Tim Chase
PostgreSQL works just fine with filesystem-level snapshots so long as
they're performed atomically across all filesystems on which the cluster is
stored. This is spelled out fairly clearly in section 24.2 of the manual.
The interesting text starts in the paragraph beginning with "An alternative
file-system backup".

I would agree that PG and filesystem level snapshots would work, but this
1) Running PG on top of ZFS directly would have a significant performance
hit. Currently, we're running on EXT4 partitions on Enterprise SSDs and use
gobs and gobs of RAM for maximum performance.

Is this your tested result, or a predicition?

While zfs might introduce its own overhead, the added features might
worth it. And sometimes (depending on the workload) adding more ram
will help.

2) We use the DB dumps in order to feed our dev environment. This lets us
test with real data. Doing this at the filesystem level (EG: /var/lib/pgsql
on ZFS) means that we're replicating the whole cluster, not just a single
customer's database - in our app, each customer has their own database. In
our production environment, we have a total of 5 production DB servers that
replicate to a hot standby. When treating a bug report, we load that
customer's database into a dev environment DB server so that we can
reproduce the problem exactly. Doing this at the filesystem level would be
highly disruptive for us.

So you already have a hot standby? Just one? Here's an idea:
- put the standby on top of zfs. Or, if you have the resource, create
another replicated system on zfs functioning as warm standby
- perform regular snapshot (e.g. zfs-auto-snapshot)
- to handle bug report, you do:
--> clone from the appropriate snapshot (e.g. latest, or yesterday, or whatever)
--> create an isolated dev system (prefereably lxc) using the clone
--> startup the dev system
--> handle the bug report
--> after you're done, delete the clone

Turning the question on its ear: is there *any* file system that would only
write the changed blocks in a similar file? If so, we could export a zvol
and format it with that file system...

zfs only writes changed blocks. The problem in your case is that a db
dump will write new blocks.

I'm not sure I understand your zvol ideas though. While using another
fs on top of zfs might be useful in some case (e.g. gluster on xfs on
zvol), most performance overhead in zfs also exists in zvol.

--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

John McEntee

2014-07-03 11:30:47 UTC

If it has to be database dumps and it is going to a dedicated backup server. You could make the receiving zfs file system do the compression and deduplication. Beware if the box does not have enough RAM the performance becomes almost no performance. As a rule of thumb you need about 10 GB RAM of RAM for every 1TB of used space, plus a L2arc on SSD so you still have some performance if you run out of RAM. I am willing to bet you will actually get a performance increase as every little data of the database dump actually makes it to disk.

John

-----Original Message-----
From: Fajar A. Nugraha [mailto:list-***@public.gmane.org]
Sent: 03 July 2014 05:24
To: zfs-discuss
Subject: Re: [zfs-discuss] Understanding ZFS snapshots at the block level within a file

Post by Lists

Post by Tim Chase
PostgreSQL works just fine with filesystem-level snapshots so long as
they're performed atomically across all filesystems on which the
cluster is stored. This is spelled out fairly clearly in section 24.2 of the manual.
The interesting text starts in the paragraph beginning with "An
alternative file-system backup".

I would agree that PG and filesystem level snapshots would work, but
1) Running PG on top of ZFS directly would have a significant
performance hit. Currently, we're running on EXT4 partitions on
Enterprise SSDs and use gobs and gobs of RAM for maximum performance.

Is this your tested result, or a predicition?

While zfs might introduce its own overhead, the added features might worth it. And sometimes (depending on the workload) adding more ram will help.

Post by Lists
2) We use the DB dumps in order to feed our dev environment. This lets
/var/lib/pgsql on ZFS) means that we're replicating the whole cluster,
not just a single customer's database - in our app, each customer has
their own database. In our production environment, we have a total of
5 production DB servers that replicate to a hot standby. When treating
a bug report, we load that customer's database into a dev environment
DB server so that we can reproduce the problem exactly. Doing this at
the filesystem level would be highly disruptive for us.

So you already have a hot standby? Just one? Here's an idea:
- put the standby on top of zfs. Or, if you have the resource, create another replicated system on zfs functioning as warm standby
- perform regular snapshot (e.g. zfs-auto-snapshot)
- to handle bug report, you do:
--> clone from the appropriate snapshot (e.g. latest, or yesterday, or
--> whatever) create an isolated dev system (prefereably lxc) using the
--> clone startup the dev system handle the bug report after you're
--> done, delete the clone

Post by Lists
Turning the question on its ear: is there *any* file system that would
only write the changed blocks in a similar file? If so, we could
export a zvol and format it with that file system...

zfs only writes changed blocks. The problem in your case is that a db dump will write new blocks.

I'm not sure I understand your zvol ideas though. While using another fs on top of zfs might be useful in some case (e.g. gluster on xfs on zvol), most performance overhead in zfs also exists in zvol.

--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

_______________________________________________________________________

The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________

This e-mail has been scanned for all viruses MessageLabs.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Gordan Bobic

2014-07-03 11:37:01 UTC

On Thu, Jul 3, 2014 at 12:30 PM, John McEntee <

Post by John McEntee
If it has to be database dumps and it is going to a dedicated backup
server. You could make the receiving zfs file system do the compression and
deduplication.

I am not 100% sure, but I seem to recall that you cannot change
compression/dedupe options at the point of zfs receive. Is that no longer
the case?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Lists

2014-07-03 18:15:25 UTC

Post by John McEntee
If it has to be database dumps and it is going to a dedicated backup server. You could make the receiving zfs file system do the compression and deduplication. Beware if the box does not have enough RAM the performance becomes almost no performance. As a rule of thumb you need about 10 GB RAM of RAM for every 1TB of used space, plus a L2arc on SSD so you still have some performance if you run out of RAM. I am willing to bet you will actually get a performance increase as every little data of the database dump actually makes it to disk.

For context, our layout is below. What I'm talking about doing is the
possibility of doing block level deduplication in some fashion at the
"Production Location BACKUP" server. FILE1-3 is a simple file store
used by our application, which writes to all of them in parallel for
real-time redundancy.

### Production Location
# Postgresql servers replicate with streaming async. Each server hosts
50 to 100 client databases.
DBSERVER1 -> DBSERVER1a
DBSERVER2 -> DBSERVER2a
DBSERVER3 -> DBSERVER3a
DBSERVER4 -> DBSERVER4a
DBSERVER5 -> DBSERVER5a
FILE1 (zfs 0.6.2)
- makes snapshots nightly that are replicated to BACKUP, Disaster
Recovery FILE1, and Dev FILE1
FILE2 (ext4/lvm2)
FILE3 (ext4/bare partition)
BACKUP
- pg_dump > zfs pool every hour
- zfs send/receive from FILE1, keeps 1 month of daily snapshots
"just in case".

# Disaster Recovery Location
DBSERVER1 reloads pg_dump files nightly from production PG dumps. We
will soon async replicate.
DBSERVER2 "
DBSERVER3 "
DBSERVER4 "
FILE1 (zfs 0.6.2)
- zfs receive nightly from production's File1 -> pool, creates a
clone of most recent snapshot available for DR's copy
of the application.

# Dev/Archive Backups Location
DBSERVER1 reloads pg_dump files from onsite backups
DBSERVER2 "
FILE1 (zfs 0.6.2)
- replicates zfs pools for files and db snapshots nightly from
production BACKUP server.
- round-robin backup of misc servers via rsync, snapshot locally.
- clones of snapshots are available for developers to use.
FILE2 (zfs 0.6.3)
- Streams everything from FILE1 local server in case FILE1 dies
horribly.
* Having severe performance issues *

Post by John McEntee

Post by Lists
I would agree that PG and filesystem level snapshots would work, but
1) Running PG on top of ZFS directly would have a significant
performance hit. Currently, we're running on EXT4 partitions on
Enterprise SSDs and use gobs and gobs of RAM for maximum performance.

Is this your tested result, or a predicition?
While zfs might introduce its own overhead, the added features might worth it. And sometimes (depending on the workload) adding more ram will help.

It's a prediction based on performance changes on the file servers
mentioned above. When we started using ZFS about 6 months ago, we did so
by adding a 3rd storage pod to our file storage pool. While ZFS does
keep up with the load, it is significantly slower, taking at least 2x as
long to do a local rsync, even though it's newer/faster hardware. It's
ridiculously faster than rsync for backups, so we're very happy! But
this experience has made me hesitant to use ZFS *everywhere*.

Also, I'm seeing some terrible performance issues with the secondary,
offsite ZFS server. (FILE2)

Our DB query volume is extreme (about 1000 queries/second at peak) and
we've do *anything we can think of* to eke another bit of performance
out. 128 GB of ECC RAM, Enterprise SSDs, highly tuned PG config, heavy
use of indexes, continuous query analysis and tuning, etc.

Post by John McEntee

Post by Lists
2) We use the DB dumps in order to feed our dev environment. This lets
/var/lib/pgsql on ZFS) means that we're replicating the whole cluster,
not just a single customer's database - in our app, each customer has
their own database. In our production environment, we have a total of
5 production DB servers that replicate to a hot standby. When treating
a bug report, we load that customer's database into a dev environment
DB server so that we can reproduce the problem exactly. Doing this at
the filesystem level would be highly disruptive for us.

- put the standby on top of zfs. Or, if you have the resource, create another replicated system on zfs functioning as warm standby
- perform regular snapshot (e.g. zfs-auto-snapshot)
--> clone from the appropriate snapshot (e.g. latest, or yesterday, or
--> whatever) create an isolated dev system (prefereably lxc) using the
--> clone startup the dev system handle the bug report after you're
--> done, delete the clone

Each production DB server has a warm, async replicated standby for
manual cutover if a problem arises. DR cluster is not redundant. Each
production DB server hosts perhaps 50 to 100 client databases.

Cloning at the filesystem level would get very messy very quickly, as
we'd have to use many dedicated machines (or at least, a farm of 20-40
VMs) in order to allow our 8 developers to analyze specific issues in
parallel. As we sit now, two dev cluster DB servers can comfortably
handle over 100 DB instances without issue.

Post by John McEntee

Post by Lists
Turning the question on its ear: is there *any* file system that would
only write the changed blocks in a similar file? If so, we could
export a zvol and format it with that file system...

zfs only writes changed blocks. The problem in your case is that a db dump will write new blocks.
I'm not sure I understand your zvol ideas though. While using another fs on top of zfs might be useful in some case (e.g. gluster on xfs on zvol), most performance overhead in zfs also exists in zvol.

When doing the backups, performance is less of an issue. My thought is
that if we could do something like rdiff-backup at the filesystem level
and write only the changes between a file and its successor, we might
see performance improve as io/iops overhead plummets.

It may just be a pipe dream, though. I might just try hacking something
together actually using rdiff-backup and see what the result is.

-Ben

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Fajar A. Nugraha

2014-07-03 18:37:55 UTC

Post by Fajar A. Nugraha
- put the standby on top of zfs. Or, if you have the resource, create
another replicated system on zfs functioning as warm standby
- perform regular snapshot (e.g. zfs-auto-snapshot)
--> clone from the appropriate snapshot (e.g. latest, or yesterday, or
--> whatever) create an isolated dev system (prefereably lxc) using the
--> clone startup the dev system handle the bug report after you're
--> done, delete the clone

Each production DB server has a warm, async replicated standby for manual
cutover if a problem arises. DR cluster is not redundant. Each production
DB server hosts perhaps 50 to 100 client databases.
Cloning at the filesystem level would get very messy very quickly, as we'd
have to use many dedicated machines (or at least, a farm of 20-40 VMs) in
order to allow our 8 developers to analyze specific issues in parallel. As
we sit now, two dev cluster DB servers can comfortably handle over 100 DB
instances without issue.

I don't see why you'd need 20 machines/VMs when your existing "two dev
cluster DB servers can comfortably handle over 100 DB instances without
issue". From my experience with mysql and oracle, overall database size
doesn't really contribute much to resource needs (e.g. memory, cpu) as much
as the size of data being accessed and what you do with it. So if your
existing server can handle whatever-you-throw-at-them when loaded with only
one db, it should also be able to handle handle whatever-you-throw-at-them
when running from the clone, as long as you're only accessing the same db
(the one that you would usually load manually).

Also, lxc is VERY efficient virtualization. If you can run 100 db instances
on two servers (where each differs by database path and listening port),
then you can run 100 lxc containers on the same server (where each
container would have a different IP and run a db instance with the same
settings).

Note that you don't HAVE to use lxc if you're not familiar with it. It's
simply easier, since you can run each instance in the container, where each
instance would use standard settings (e.g. standard TCP listening port). If
you're more used to using a custom path and custom port for each instance,
then you can use that as well. You pretty much replace "setup new db
instance and load from backup" step to "clone from latest send/receive
snapshot and adjust the settings (e.g what TCP port to listen)".

--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Lists

2014-07-03 19:38:01 UTC

Post by Fajar A. Nugraha
I don't see why you'd need 20 machines/VMs when your existing "two dev
cluster DB servers can comfortably handle over 100 DB instances
without issue". From my experience with mysql and oracle, overall
database size doesn't really contribute much to resource needs (e.g.
memory, cpu) as much as the size of data being accessed and what you
do with it. So if your existing server can handle
whatever-you-throw-at-them when loaded with only one db, it should
also be able to handle handle whatever-you-throw-at-them when running
from the clone, as long as you're only accessing the same db (the one
that you would usually load manually).

Here's the problem with filesystem-level ZFS for PG replication: you get
all or nothing. If you rsync/snapshot /var/lib/pgsql and load it, you
get the state of all the DBs in the PG cluster on that server as of the
time of rsync. They all go together, so if you use the zfs clone method,
the server (or instance) gets a complete reset of all DBs for that
server. Since each dev has anywhere from 5 to 20 client DBs in testing
at any given moment, that would mean upwards of 8 * 20 VM/postgresql
instances, each having a full copy of the Production DB server running
at that time. I realize that cloning removes most of the overhead in
doing so, but it's not a route that initially excites me.

I'll give this some thought, however.

Doing the db dump route, we can load a single client's DB. We have a
script that drops the DBs in dev in a "least recently used" fashion.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Fajar A. Nugraha

2014-07-03 21:59:19 UTC

Since each dev has anywhere from 5 to 20 client DBs in testing at any
given moment,

Aaah, I see. So previously you were using one instance to load 5-20 dbs?
Makes sense.

that would mean upwards of 8 * 20 VM/postgresql instances, each having a
full copy of the Production DB server running at that time. I realize that
cloning removes most of the overhead in doing so, but it's not a route that
initially excites me.

A "hybrid" method would be where you have one db (either dev or backup)
running replication, AND with zfs auto snapshot on (so you can
go-back-in-time if you have to). When a dev needs to troubleshoot, he dumps
a client db from that replica, and load it in his own db instance. It
should work since for most troubleshooting purpose you'd probably only need
the latest copy of the data anyway.

The other ways would not provide much space savings. Even dedupe could fail
miserably when there is a shift in resulted dump file caused by some
records inserted in the table.

Good luck with whatever you choose.

--
Fajar

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

John McEntee

2014-07-04 15:10:25 UTC

Post by Lists
FILE2 (zfs 0.6.3)
- Streams everything from FILE1 local server in case FILE1 dies horribly.
* Having severe performance issues *
Our DB query volume is extreme (about 1000 queries/second at peak) and we've do *anything we can think of* to eke another bit of performance out. 128 GB of ECC RAM, Enterprise SSDs, highly > tuned PG config, heavy use of indexes, continuous query analysis and tuning, etc.

What is the uptime of FILE2 with the * Having severe performance issues * and the size of RAM of 128GB RAM? I have a box with 192 GB RAM with severe performance issues as well, a reboot has solved the problems at the moment. My googling has indicated a problem with performance when release L2ARC (I think) on system with a RAM at 128GB or Higher. I was going to reduce my system to 96 GB RAM, but I have not tried it yet to confirm or deny this theory. The reboot has worked for the moment, but it is a lightly loaded system.

John

_______________________________________________________________________

The contents of this e-mail and any attachment(s) are strictly confidential and are solely for the person(s) at the e-mail address(es) above. If you are not an addressee, you may not disclose, distribute, copy or use this e-mail, and we request that you send an e-mail to admin-***@public.gmane.org and delete this e-mail. Stirling Dynamics Ltd. accepts no legal liability for the contents of this e-mail including any errors, interception or interference, as internet communications are not secure. Any views or opinions presented are solely those of the author and do not necessarily represent those of Stirling Dynamics Ltd. Registered In England No. 2092114 Registered Office: 26 Regent Street, Clifton, Bristol. BS8 4HG
VAT no. GB 464 6551 29
_______________________________________________________________________

This e-mail has been scanned for all viruses MessageLabs.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Andrew Reid

2014-07-03 19:10:04 UTC

If you save a file to a ZFS file system, and then overwrite the entire file with very similar content, does
ZFS reallocate a new set of blocks for the whole file, or is it "smart enough" to write only the changes in the
file being rewritten over?

If you have not already tried it, you could/should do a very quick experiment.

dB dump to file
rsync "inplace" file to a file on zfs dataset
snapshot the zfs dataset

repeat hourly

There is a chance that the snapshots will only contain the deltas and they will be small. There are many things which might cause this not to be the case, but it is worth trying at least. I did this in an analogous situation (nothing to do with databases, but involving a series of large files with very small deltas (compared to the size of the file) and was very happy with the results.

Certainly worth the try, because if it works you have independent access to any and every iteration of the dump file.
Worse that can happen is the snapshots blow up to approach the size of of each iteration, which is where you are now.

Also anything that you can do to break up the dump into pieces which are likely to stay static will improve your chances.

Andrew

We do Postgresql database dumps the usual way with pg_dump. The dumps are performed via cron scripts that run
hourly. We currently keep backups of these for 24 hours, then we keep the 10 PM backup in an archival form pretty
much forever. Nightly a script makes a tar of the hourly onto external media, and then rsync the files offsite.

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

Lists

2014-07-03 23:57:38 UTC

Post by Andrew Reid
If you have not already tried it, you could/should do a very quick experiment.
dB dump to file
rsync "inplace" file to a file on zfs dataset
snapshot the zfs dataset
repeat hourly
There is a chance that the snapshots will only contain the deltas and they will be small. There are many things which might cause this not to be the case, but it is worth trying at least. I did this in an analogous situation (nothing to do with databases, but involving a series of large files with very small deltas (compared to the size of the file) and was very happy with the results.
Certainly worth the try, because if it works you have independent access to any and every iteration of the dump file.
Worse that can happen is the snapshots blow up to approach the size of of each iteration, which is where you are now.
Also anything that you can do to break up the dump into pieces which are likely to stay static will improve your chances.

Gave this a shot as the most promising method to do what I've been
seeking. Sadly, even --inplace rewrote the entire file so 36 dumps of a
1.3 GB file caused a snapshot to be 1.3 GB in size.

Thanks for the idea.

-Ben

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org

19 Replies
255 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Lists 2014-07-02 01:16:00 UTC

Fajar A. Nugraha 2014-07-02 03:42:04 UTC

Kenneth Henderick 2014-07-02 05:13:00 UTC

Lists 2014-07-02 06:18:29 UTC

Fajar A. Nugraha 2014-07-02 07:12:35 UTC

Andrew Holway 2014-07-02 07:17:16 UTC

Andrew Holway 2014-07-02 07:18:42 UTC

Fajar A. Nugraha 2014-07-02 08:17:22 UTC

Tim Chase 2014-07-02 18:33:26 UTC

Lists 2014-07-02 19:32:22 UTC

Fajar A. Nugraha 2014-07-03 04:23:58 UTC

John McEntee 2014-07-03 11:30:47 UTC

Gordan Bobic 2014-07-03 11:37:01 UTC

Lists 2014-07-03 18:15:25 UTC

Fajar A. Nugraha 2014-07-03 18:37:55 UTC

Lists 2014-07-03 19:38:01 UTC

Fajar A. Nugraha 2014-07-03 21:59:19 UTC

John McEntee 2014-07-04 15:10:25 UTC

Andrew Reid 2014-07-03 19:10:04 UTC

Lists 2014-07-03 23:57:38 UTC

about - legalese

Loading...