Lists
2014-07-02 01:16:00 UTC
If you save a file to a ZFS file system, and then overwrite the entire
file with very similar content, does
ZFS reallocate a new set of blocks for the whole file, or is it "smart
enough" to write only the changes in the
file being rewritten over?
I'm trying to optimize our database backup scheme using ZFS, and knowing
exactly what ZFS is doing
might make for dramatic improvement in space utilization.
We do Postgresql database dumps the usual way with pg_dump. The dumps
are performed via cron scripts that run
hourly. We currently keep backups of these for 24 hours, then we keep
the 10 PM backup in an archival form pretty
much forever. Nightly a script makes a tar of the hourly onto external
media, and then rsync the files offsite.
However, the hourly DB dumps take up a *lot* of space. It's sad, really,
because the data is both highly compressible
and highly duplicative. There is almost never even a 1% change from one
hour to the next. We once experimented with
using diff to keep only the changes from one dump to the next, but that
was far too processor intensive to be feasible.
### what we do now
EG:
$ pg_dump -u $dbuser $database -h $dbhost >
/path/to/backups/$database.$hour.pgsql
# 10 pm
$ tar -zcf /path/to/archive/$database.$date.pgsql.tgz
/path/to/backups/$database.05.pgsql
$ rsync -vaz --delete /path/to/archive/
***@archiveserver:/path/to/archive/
### end what we do now
Reference following scenario. If ZFS is smart enough to note block level
changes within a single file being
overwritten, then this might be extremely space efficient, since only
the changes from one hour to the
next are being allocated and written to by ZFS. In this view, ZFS
essentially would be doing a diff and writing
the changes in the new snapshot.
### scenario
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.pgsql
zfs snapshot zpool/dbbackups-***@public.gmane.org$hour
/// 1 hour later
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.pgsql
zfs snapshot zpool/dbbackups-***@public.gmane.org$hour
# ... repeat ...
### end scenario
The money question I guess is: In this scenario, would ZFS store 24
copies of the database dump in its
snapshots, or 24 sets of changes to a single database dump file? Note
that the file name within the file system
is the same, the zfs snapshots would provide the ability to see what
hour of backup is available.
Thanks,
Ben
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org
file with very similar content, does
ZFS reallocate a new set of blocks for the whole file, or is it "smart
enough" to write only the changes in the
file being rewritten over?
I'm trying to optimize our database backup scheme using ZFS, and knowing
exactly what ZFS is doing
might make for dramatic improvement in space utilization.
We do Postgresql database dumps the usual way with pg_dump. The dumps
are performed via cron scripts that run
hourly. We currently keep backups of these for 24 hours, then we keep
the 10 PM backup in an archival form pretty
much forever. Nightly a script makes a tar of the hourly onto external
media, and then rsync the files offsite.
However, the hourly DB dumps take up a *lot* of space. It's sad, really,
because the data is both highly compressible
and highly duplicative. There is almost never even a 1% change from one
hour to the next. We once experimented with
using diff to keep only the changes from one dump to the next, but that
was far too processor intensive to be feasible.
### what we do now
EG:
$ pg_dump -u $dbuser $database -h $dbhost >
/path/to/backups/$database.$hour.pgsql
# 10 pm
$ tar -zcf /path/to/archive/$database.$date.pgsql.tgz
/path/to/backups/$database.05.pgsql
$ rsync -vaz --delete /path/to/archive/
***@archiveserver:/path/to/archive/
### end what we do now
Reference following scenario. If ZFS is smart enough to note block level
changes within a single file being
overwritten, then this might be extremely space efficient, since only
the changes from one hour to the
next are being allocated and written to by ZFS. In this view, ZFS
essentially would be doing a diff and writing
the changes in the new snapshot.
### scenario
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.pgsql
zfs snapshot zpool/dbbackups-***@public.gmane.org$hour
/// 1 hour later
$ pg_dump -u $dbuser $database -h $dbhost > /zpool/dbbackups/$database.pgsql
zfs snapshot zpool/dbbackups-***@public.gmane.org$hour
# ... repeat ...
### end scenario
The money question I guess is: In this scenario, would ZFS store 24
copies of the database dump in its
snapshots, or 24 sets of changes to a single database dump file? Note
that the file name within the file system
is the same, the zfs snapshots would provide the ability to see what
hour of backup is available.
Thanks,
Ben
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe-VKpPRiiRko7s4Z89Ie/***@public.gmane.org