The Hype
One of the common requests we often hear is for us to implement delta sync. Since beginning our implementation, we've found that it really doesn't help as much as many are expecting. Many people expect that delta sync has a large impact on syncing speeds, often times because others in the sync space have heavily advertised this feature. Many have seen demo videos like Dropbox's where a large image is edited and only the small change made needs to be uploaded. The video says that because of delta sync, over 80% of the bandwidth was saved because of delta sync.
Claims like these are just plain misleading. As the example below shows, the savings a normal user would see is actually 0%. Looking at the types of data we see users synchronizing, the fact is most people won't see much benefit from delta sync.
What is Delta Sync?
For those unfamiliar with delta sync, it is a technology designed to detect and send only the parts of a file that have changed. If you have a large two megabyte file and change only two bytes in it, rather than re-upload the entire two million byte file, delta sync allows you to send just the two bytes that changed. While not as useful for most small files (the bookkeeping and header information for chunk tracking starts to eat into any benefit), it appears at first glance that delta sync would seem to provide a huge benefit for large files.
A Small Example
The reason delta sync doesn't help as much with large files is that almost all large file types are stored compressed. Videos, music, digital photos, photoshop files, PDFs, you name it -- the files you deal with day-to-day are all stored compressed. Unfortunately, compression negates any benefit from delta sync. When a file is stored compressed, in the process of saving the file, the file is run through a special process that finds duplicate data and removes it. This means any change to the file, no matter how minor, changes the entire file.
So how do we get these claims of bandwidth savings for large files? With a bit of slight of hand and a some contrived circumstances. To use a concrete example, I'll use the Dropbox demo itself. In it, a picture of a platypus is drawn over with a white X and we see the claim of 80% bandwidth savings due to delta sync. A detail that is somewhat glossed over is the fact that while a file being shown before the edit happens is a JPEG (.JPG - digital image format used by almost all digital cameras), the file actually edited is a bitmap (.BMP - an uncompressed and uncommonly used type).
I took an almost identical graphic and made a similar edit. Using the rsync tool, which is the same rsync algorithm Dropbox uses, I measured the bandwidth savings between edits made on the compressed JPG and uncompressed BMP files. The difference is striking.
For the uncompressed bitmap file, a 65K difference was generated for a 476K file, a 86.4% savings -- inline with the demo.
For the compressed JPEG file, a 0K difference was generated for the 76K file, a 0% savings. There were no bytes saved versus uploading the entire file.
While the Dropbox demo doesn't lie, it also is quite misleading. While the 86.4% savings is nice, it neglects to mention that no normal end-user uses bitmaps to store their images and the bandwidth of 65K required to send just the changes is almost as large as the entire file of 76K when it's stored in a proper file format.
Try this with a music file, a video, photoshop files, or any other large files and you'll find that almost none of large files commonly benefit from delta sync.
What Delta Sync is Good For
While delta sync doesn't help much for most people's common day to day files, it is incredibly useful for cases in which large files are stored uncompressed. The most typical case is log files for system administrators. These gargantuan text files of things like web server accesses are often hundreds of megs to many gigabytes in size. Data is just appended onto the end as additional activity is logged. This is a perfect case for delta sync.
Relative to the normal user however, use cases like these are the exception rather than the rule. Next time you see delta sync marketed about as a way to save gobs of bandwidth, definitely take those claims with a grain of salt.
How I Tested
If you'd like to try repeating the results for yourself, you can download the files I used
here.
I used rsync version 2.6.9 with the following commandline to force a delta sync:
foreach f (*Edited*)
rsync --stats -e 'ssh' $f localhost:tmp/`echo $f | sed 's/_Edited//'`