Home > Distributed Version Control > Why DVCS is better with binaries than you might think

Why DVCS is better with binaries than you might think

Almost every discussion about distributed version control systems (DVCS) on the Internet includes at least one post along the lines of “DVCS can’t merge binary files and my project has binary files.” There is a good reason why you might not want to use DVCS for binaries, but contrary to popular belief, DVCS not being able to merge them isn’t it. My purpose here is to try to convince you why, with some exercises you can try yourself at home.

Most of the confusion arises from not understanding the difference between merging binary files and merging branches containing changed binary files. The former no version control software can do, and the latter DVCS can do just as well as any other VCS, if not better.

I can already feel the heat from people preparing their replies. This is one of those things that people will argue about forever even when they’re wrong, because it seems so obvious that they don’t need to bother trying it for themselves. In anticipation of this, I’ve prepared a short demonstration.

Here’s the scenario: Bob is working on updating the documentation on his company’s website for their upcoming software release. Bob branches from the staging branch so he can work on updating the screenshots to move the minimize/maximize/close buttons to the left. It takes Bob longer than the original estimate, because he didn’t realize you could just take new screenshots instead of editing the old ones in Gimp, but he eventually merges his branch back into the staging branch.

At this point we have one branch that says buttons go on the left and one that says buttons go on the right. Obviously a merge conflict, right? Wrong. The merge algorithm knows the screenshots haven’t changed in the staging branch since Bob branched off and does the right thing. Don’t believe me? Try it for yourself. I’ll wait.

Some people get it in their head that this will work for text files, but not binary files, because binary files aren’t “mergeable.” Note that in this scenario, the merge algorithm doesn’t care if the files are mergeable or not.

That’s not to say there aren’t scenarios where mergeability matters, but with binary files you hope you never get into that situation, because no version control can get you out of it. If Alice is changing the same screenshots at the same time as Bob, there’s no way to merge them automatically.

To help out people who don’t like scary solutions like communicating with your coworkers, most centralized version control software lets you place a lock on a file. Because none of the major distributed software has this lock feature yet, people claim it’s because it’s fundamentally impossible with DVCS.

While it’s true locking requires communication with a central lock authority, there’s no need for that to be in the same place as everything else, nor is there a need to be in constant contact with that central authority. If people spent as much time implementing the feature as they do whining about the lack of it, every DVCS implementation would have had locking years ago.

As I mentioned, there is one good reason why you might not want to adopt DVCS for your binary files. Binary diffs tend to be larger than text diffs, and with DVCS all the history gets stored on every developer’s computer. However, you shouldn’t assume that every change will increase your repository size by 100% of the binary file size. In my test for the Bob scenario, it only increased about 36% for Bazaar. You also shouldn’t assume that all that history is being copied every time you make a new branch. All the major software lets you share the diffs between local branches, and although the initial checkout may take a while, after that only the changes are communicated.

In conclusion, if you have been avoiding evaluating DVCS because of the binary myth, you might want to give it a second look and actually try it out on your own files. You may still find CVCS to be a better fit, but at least that decision will be based on evidence. On the other hand, I think you have a good chance of being pleasantly surprised.

  1. Some call me Tim
    July 19, 2010 at 1:47 am

    I’ve spent a lot of time thinking about the problem you describe, and I came to the exact same conclusion, though with two big caveats.

    I agree that you need file locks to be safe editing binaries. I guess I was never confused about how to handle the histories, though it turns out that PERFORCE developers WERE confused about that exact point, and did auto-merging wrong on files that are set to have no history (and so they couldn’t do diffs with previous versions). They even told me that they couldn’t fix the bug, though I believe they eventually did, after I explained how. Sigh.

    DVCS systems don’t support locks, but could, or you could handle locks outside of DVCS, exactly as you describe. In practice, most groups probably end up using email as their system outside of DVCS, though for obvious reasons I’d like to see things integrated. What I don’t relish is the idea of having to roll my own solution, though frankly it wouldn’t be hard.

    The first big caveat that kills it for me is that, from what I’ve heard, the current DVCS systems don’t support really big repositories well. I tried getting Bazaar to check out a decent sized SVN repository (using the built-in bridge) recently, only to have it fail spectacularly, for example. That may be bugs in the SVN bridge though. Come to think of it I’ve tried a couple of other BZR-SVN checkouts, only to have them fail partially as well.

    But more importantly, I read one article (which I can’t track down now, sorry) where someone had tried to create a git, Bazaar, and Mercurial repository on a really big source tree (>20 Gb?) and that each just died after sitting for an extremely long time. If that’s wrong I’d love to hear a follow-up–I know bugs in these systems are being fixed all the time.

    Even if that’s out of date, the other caveat is that partial repository checkouts really would be crucial for DVCS systems to be sane to use on really large repositories. I work with people from remote, and if a particular project has gigabytes of history, I’m not going to want to make someone download that entire history just to work on a small corner of the project. My upstream is only about 12kB/second; that’s over twenty-one hours per gigabyte, assuming you’re using my entire pipe. Also, I’d love to work with my external artists or designers such that they could directly submit to my source control–and yet have them never actually see or download any of the core game source code, which they’d never actually need to see or have access to. And yet I’m going to want the history of the entire project to be coherent (and commits to be atomic across the entire project).

    If any of the DVCS systems out there can do this, sign me up. I’m using Bazaar right now for stuff that’s not leaving my computer, but when dealing with external contractors, everything is happening via email or FTP or equivalent. I could imagine some system made up of mini-repositories, though even then the art history alone could end up in the gigabyte range, which would suck for bringing new artists online. If you have any thoughts on this, I’d love to hear them.

    • July 19, 2010 at 10:42 am

      My bazaar repo at work is around 8 Gb and it does just fine. That first checkout takes forever, though.

      DVCS is ideal for working with external contractors. It’s a shame the large binary history issue is getting in your way. One thing you could look into and see if it works for you is lightweight checkouts. Those basically keep the history on the server, with the obvious implications.

      Other than that, I think you’re stuck with roll your own solutions, or keeping your art in an svn repository that’s imported into a bzr branch for your source code. I’m not opposed to people using centralized version control when it’s the right tool for the job. I just don’t want them avoiding DVCS for the wrong reasons.

  2. Christopher Jefferson
    July 19, 2010 at 3:51 am

    Last time I checked, you couldn’t add to a git repository any file larger than about 1.8GB from a 32-bit computer, and it wanted to be able to map the whole file into memory.

    Also we produce a 2GBish file every week or so which we check into version control. You don’t need to do that very many times before people’s git repositories grow totally out of control.

    • July 19, 2010 at 9:48 am

      You definitely fall into the “one good reason” I mentioned. My main objective was to dispel the myth that since DVCS is unusable with very large binaries, it must also be unusable with smaller, less frequently changed, binaries.

  3. July 19, 2010 at 6:37 am

    What a red herring! DVCS is great, but I’ve never heard one person complain about merging of binaries. The complaint is storing all history on the client side – the compounding nature of binaries in SCM.

    • masklinn
      July 19, 2010 at 7:40 am

      > DVCS is great, but I’ve never heard one person complain about merging of binaries.

      Because the people who might run into that issue (games studio, people dealing with sound or video, etc…) are smart enough to just stay on their centralized workhorse.

    • July 19, 2010 at 9:43 am

      You must hang out on more intelligent forums than I do, then. Here’s just one example.

  4. March 21, 2011 at 11:54 am

    Awesome info! I have been searching for something such as this for quite a while now. Bless you!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: