Home > Tech Crunch > Rsyncable gzip

Rsyncable gzip

This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/206).

GZIP="--rsyncable" tar zcvf toto.tar.gz /toto

Why do you need this special option ?

Because if you compress your files before synchronising them with rsync, a very small change in one original file may force rsync to re-transmit the whole compressed tar.gz file, instead of just the changed portion.

The basic reason is that rsync works at the byte level : very roughly, it compares the old copy of the file with the latest source, and transmits every byte that is different to update the old copy and make it identical to the new. rsync uses a smart way of doing these comparisons, so that in most cases only a tiny portion of the file needs to be actually transmitted.

Unfortunately, file compression algorithms which use an adaptative compression method (like most do), defeat the rsync logic and can cause the whole file to be retransmitted, even if only one byte has been changed.

Why is that so ?

An adaptative compression method uses an analysis of the bytes already processed, to determine how best to compress the following bytes of the file. For example, suppose the compression program starts at byte 0 with a certain compression method. After 1000 bytes have been compressed, the program will recalculate a new compression method, based on what it found in bytes 0-999. It will then insert a new compression table into the file, and use this table to compress the next 1000 bytes. Then it recalculates it’s compression table based on the bytes 0-1999, and does the same, and so on. This means that a change of one byte in bytes 0-999, can potentially change the compression method for the rest of the file, and that the rest of the output bytes will be totally different. And because rsync compares the files byte per byte, it will not find any similar block of bytes between the old and new file, thus will be forced to resend the whole new compressed file.

The --rsyncable option above fixes this problem. With this option, gzip will regularly “reset” his compression algorithm to what it was at the beginning of the file. So if for example there was a change at byte 23, this change will only affect the output up to maximum (for example) byte #9999. Then gzip will restart ‘at zero’, and the rest of the compressed output will be the same as what it was without the changed byte 23. This means that rsync will now be able to re-synchronise between the old and new compressed file, and can then avoid sending the portions of the file that were unmodified.

Now, for the example above, suppose “/toto” is a directory with plenty of small files for a total of 50 MB, thus the uncompressed tar file would be about 50 MB. By compressing it with gzip, we bring this down to 15 MB in the tar.gz file. Now we ‘rsync’ this file with a remote system.

If nothing has changed since yesterday in the /toto directory, the tar.gz file will be the same as yesterday, rsync will detect this and the file will not be transmitted.

On the other hand, if one single small file at the beginning of the ‘tar’ has changed, then without the --rsyncable option, most of the tar.gz file will be different, and rsync will have to transmit almost 15 MB to the remote rsync target system. In that case, it would have been better to not compress the tar file at all !

With the --rsyncable option, it is possible that only 1000 bytes would be different in the tar.gz file, so only 1000 bytes would be transmitted by rsync, for the same end-result.

References :

For an rsync intro, see here

For a full explanation (and only for Real Programmers), see here

There is also a good summary of the whole rsync/gzip/debian situation here

About these ads
Categories: Tech Crunch Tags:
  1. November 26, 2010 at 6:46 am | #1

    thanks for explaining this. Very easy for me to understand now. i’ll be using this GZIP feature for my transfers from now on. Cheers,

    Felipe

  2. March 28, 2011 at 1:22 am | #2

    I recently stumbled on that in the man page for gzip, and I also recently started rsyncing a few hundred gzips to a host, so it was quite useful. Glad to have an explanation.

  1. October 14, 2011 at 11:29 pm | #1
  2. October 25, 2012 at 6:54 pm | #2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 60 other followers

%d bloggers like this: