I’ve been accused before of falsely attributing to distributed version control systems (DVCS) what is really a result of a solid branching model. A certain experiment at work illustrated my point quite well, and I thought it was of general enough interest to share.
When we get close to release, we make what is called a stable label build once a week or so, which is what gets distributed to the testers. Since a lot of destabilizing development is also going on at this time, our process had been to hold off on check ins for a bit unless they are bug fixes for the stable label.
Sometimes this process can take a few days, which causes problems of its own, so someone came up with the idea of an integration branch where developers do all their work, and once a week it would be branched off into a stable label branch, where only bug fixes would be merged over from the integration branch. This is very similar to a master/develop model, which many organizations use quite successfully with git.
Here’s what the integration/stable label branching looks like:
The stable label gets branched off from B. C is a new feature that still has some bugs. D is a bug fix that we want to merge in to the stable label. Spot the problem yet? D is dependent on C, meaning it can’t be merged cleanly without some manual finesse.
They broke the cardinal rule of branching:
Remove commits when you create a branch, not when you merge it.
In this case, D branched from C (although it’s not obvious because it’s the only one), but in order to follow the cardinal rule and remove C upon creation, it needed to be branched from B, like this:
Now the merge into the stable label branch is no more difficult than a normal check in. No need to manually remove the dependency on C because you never brought it in. If you come from the centralized world, however, you might be thinking, “That’s crazy! Branching from B means every single developer would require their own individual branch every time they check in!” Exactly. The entire premise and strength of DVCS is that branches are cheap and unobtrusive.
To be fair, we could probably accomplish what we want with centralized version control with something like this:
If developers need to do a bug fix, they check out and into the bug fix branch, then merge their change into the features branch. The weekly stable label build is taken from B, after which a new bug fix branch is created from F. This doesn’t violate the cardinal rule, because the features branch is never promoted to the bug fix branch until you are ready to take all the new features.
There are a few weaknesses with that approach compared to DVCS, however:
- Developers have to frequently switch between two branches, something that our CVCS at least doesn’t support very efficiently without maintaining two separate working trees.
- It depends on the developer to be dilligent about choosing the right branch and doing the merges, rather than the person responsible for stabilizing the build being able to merge in the fixes he wants after the fact.
- If a bug fix is inadvertently worked on in the features branch, there is no easy way to move it.
As you can see, DVCS has a distinct advantage in this particular case study.
As a software engineer, I’ve certainly used my share of version control software (VCS) for my code. However, like most of us, I occasionally have to put on my tester hat, when I’m given a bug report and have to narrow it down before I know what part of the code to look at. I found myself using VCS for the first time ever in my tester role last Friday, and it was so useful I can’t believe I never thought of it before.
My bug report came to me that around 3 out of 4 times when starting from factory default and using a certain configuration, a specific failure occurred. That’s the kind of bug report I like: easy to repeat. The problem was, that configuration contained 200 lines, only one of which was likely to be the problem, and no easy way of knowing which one. Additionally, it might cause a problem only in combination with previous configuration items. The failure, although easy to test for, had many possible causes, half of which were contained in a black box we are interfacing with.
We’ve all been there. You get to a point where you think you’re close to the answer, and someone says, “Wait, when it worked before did we do x first or y first?” You’ve manipulated 200 lines of config in every which way imaginable, and no one can remember your previous results without going back and repeating previous tests. A perfect use case for version control software, so I gave it a try.
I used one file containing my config, and three associated log files with test results, all committed to version control at appropriate points with useful commit messages.
The difference was startling compared to previous similar tasks. Whenever I hit a dead end, I could back up to a known position and proceed with confidence that I wouldn’t forget something. Knowing which part of the log was associated with which test was a snap. I only had to think about the logical next step instead of everywhere I’d been. Since I had to keep less in my head at once, more brain power was devoted to debugging than to record-keeping, and I got the problem narrowed down with much less mental fatigue. When I go back to work tomorrow morning, it will be easy to retrace my steps if necessary, even though the weekend knocked it out of my mind.
So, even if you have never written a line of code in your life, it might be very worthwhile to learn version control software. Something like bazaar is fairly easy to learn, can be set up in any directory, without any server required, and removed as easily as deleting a folder when no longer needed. If you’ve been looking to take your testing quality to the next level, I highly recommend giving it a try. I know I will never look back.
I keep hearing Ubuntu described as merely a noob’s distro lately. However, Ubuntu has around 50% of the Linux desktop market share, give or take, but Linux as a whole has only gained a tenth of a percent or so since Ubuntu’s introduction. So either noobs adopted Ubuntu in such numbers that half of Linux veterans switched to Windows in protest, or there are quite a few veterans out there running Ubuntu, but who apparently don’t think it’s cool to admit it.
Well, it’s about time people either come clean or switch already. I’ll start the ball rolling. My name is Karl (Hi, Karl), and I’m a Linux veteran who runs Ubuntu. I switched from Windows 98 to Red Hat, then Mandrake, Suse, Linuxfromscratch, a customized Knoppix for a year when my laptop hard drive crashed and I couldn’t afford to replace it, then Gentoo for about 5 years, and have been running Ubuntu exclusively since Jaunty. I’ve maintained a custom set of conflicting kernel patches, I grep the source before asking on forums, have contributed patches and documentation for various projects here and there, have gone weeks at a time without any GUI at all, and once cross-compiled a bare bones installation for a 486 I had laying around, just to see if it could still be done (it could, and was quite usable without X).
I tell you this to put my Linux experience in context, and hopefully establish my credentials as a veteran. According to the buzz, Ubuntu is the last distro I should be comfortable running, but here I am. So why did I switch, and why did I stay?
When my wife ordered a Dell netbook preloaded with Ubuntu, I decided to create a Ubuntu partition on my laptop so I could advise her on “the Ubuntu way.” Right away I was impressed that my wireless interface “just worked,” and I discovered a number of other bits of UI polish that I didn’t know existed. Additionally, all the binary package dependency issues which pushed me to LFS then Gentoo in the first place were no longer a problem.
I also discovered that my wife wasn’t needing my advice. She bought an external optical drive and was watching DVDs without asking me more than what software she should use. The next time I was facing a major Gnome upgrade on my Gentoo desktop, I installed Ubuntu instead and never looked back.
Does Ubuntu get in the way of the “advanced” things I want to do? Not at all. I have cron jobs set up to do automatic backups. I get an email when one of the disks in my RAID is starting to look iffy. I have intrusion detection with rules updated daily from multiple sources. I have a rather intricate set of firewall rules, that only logs the important events. I recently set up my desktop as a full-fledged wireless router, complete with my own DNS, DHCP, and traffic shaping and prioritization. Ubuntu didn’t provide me with nice graphical widgets to set all that up, but they didn’t stand in my way either, and in many cases their documentation was quite helpful.
Does Ubuntu have its issues? Of course, but nothing I have felt the need to dwell on. When the buttons got moved to the left, I spent 30 seconds on google to fix it and moved on with my life. I’ve had a couple of upgrade issues which were just as quick to fix, then I hopped onto the forums or launchpad for a while to make sure others with the same issue but less experience were taken care of.
The Ubuntu community doesn’t seem to have a “divide” between noobs and veterans. For example, people like me who dealt with wifi on Linux before ndiswrapper even existed, but who are just learning hostapd, recognize the contributions of someone who just started using Linux a year ago, but has been working out the kinks in their hostapd setup the whole time. It is very easy to contribute at any level.
Note that I’m not disparaging any other distributions in order to build up Ubuntu. I don’t have to. Ubuntu stands on its own merits, and if something else fits your style better, more power to you. I can guarantee I won’t stick with Ubuntu forever, but I also guarantee it’s not a distro only a noob can love.
One reason I always try to be nice to newbies is that there are two different kinds: people new to everything, and very experienced people who happen to be new to one area. Everyone was the first kind at one point, and could find themselves being the second kind at any time.
This last weekend I found myself in the second category. My wireless router died on me again. I seem to go through about one per year (I suspect inadequate heat dissipation), and was sick of replacing them, so I decided to incorporate its functions into my desktop computer.
I’m about as qualified as you can get for undertaking such a task. I’ve been using Linux extensively since the late 90′s, and write embedded software for network equipment for a living. I knew all the pieces needed in generic terms, it was just a matter of finding, installing, and configuring the specific implementations on Ubuntu.
First step: purchase new hardware. I needed a second gigabit ethernet NIC for the wired side of the network, and an 802.11g interface for the wireless side. Complicating matters, since my wife depends so heavily on wireless for her netbook, Nook, and Roku, I wanted to purchase the parts from my local Best Buy instead of my usual online sources so it could be up and running faster.
The gigabit ethernet NIC I didn’t even bother looking up hardware compatibility. There may be some incompatible ones out there, but I have never seen one. The 802.11g interface is more complicated. There is wide support on the client side, but I knew using it as an access point has specific driver requirements.
I did some research on linuxwireless.org, cross referenced it with the inventory on Best Buy’s website, and came up with one possibility. Only one? That threw up a red flag, but Best Buy isn’t exactly known for their stellar Linux support, so I let it go.
I picked it up the next day after work, plugged it in, and it turned out to be v3 of that product, which used a different chipset than I was expecting. I tried a command I had seen on a few HOWTOs anyway: iwconfig wlan2 mode master. Didn’t work. Made sure I was using the best driver, and did some digging to see if there was a development version I could use, and eventually gave up.
Installing the wired side of the network with DNS, DHCP, NAT, and firewall went smoothly. At least I knew those pieces would be working when I got the wireless up.
The next day, I returned the wireless interface to Best Buy, did some more research to make absolutely sure I was getting a compatible card, and ordered one online overnight delivery. Plugged it in the next day, and the “mode master” command still didn’t work.
Now I knew something was up. After some considerable google-fu, I discovered there is an easy new userspace way of handling all the access point management, but the easiest to find documentation all documented the old, more limited way. In other words, the “mode master” command doesn’t work on purpose now. Aside from a short issue with hostapd looking like it starts, but not really because of a setting in /etc/defaults/hostapd, I had it up and running fairly quickly.
So what’s the underlying software engineering principle here? Outdated documents can sometimes be worse than no documents. Here is this great new feature that makes it easier to do what I want, but because the docs for it are so obscure, it actually made it harder.
I intend to do more about it than rant on my blog. Ubuntu will soon be getting some updated documentation courtesy one experienced newbie who just figured it out the hard way. I consider it my way of “paying” for all the great software I use for free, and encourage all of you who solve similar problems to do the same.
Why branching is better with DVCS
A reader pointed out that my article on Why DVCS provides better central control focused more on branching models than on the nature of distributed versus centralized. My response is that models that require more than one or two concurrent active branches have never worked very well with CVCS. That doesn’t mean your company doesn’t have dozens of branches in its CVCS repository, just that only one or two are likely to be under active development at any one time, whereas with DVCS active branches are naturally ubiquitous. In this article, I will explore some of the reasons why.
My remarks have in mind a medium to large-sized commercial development team consisting of several dozen developers or more who all share a certain amount of code with each other. Most of it will also apply to open source development, but most of it won’t really apply if your entire team consists of a only handful of developers.
The most obvious difference is in the merge algorithms, and their attendant data structures. It’s true that DVCS has better merge algorithms, for now. There’s nothing inherent about a centralized model that prevents CVCS from using the distributed merge algorithms, so that advantage won’t last for long. However, there are other factors which I believe inhibit the use of ubiquitous branches with CVCS, even if the merge process is just as clean.
As technically-minded people, we sometimes forget the most important component of any version control system: the human element. The choice of decentralized versus centralized has a large impact on human behavior. I’ll cover four of those impacts: shared resources, deciding when to branch, permissions, and spheres of disruption.
What do humans naturally do when they share a resource, such as, for example, a central version control server? They form committees and seek to come to a consensus on any decision about that resource. The more people who share it, the worse the “lowest common denominator” decision is. Creative ideas take a long time to come to fruition, because everyone has to be convinced. That’s why there are only one or two active branches of development, because that’s what we could get everyone to agree to.
Contrast that committee approach with a small subteam of 5 or so developers working in the same distributed repository. As long as the interface with the rest of the company remains intact, they are free to try out any idea they think will help move things along faster.
Deciding When to Branch
I often hear CVCS proponents say they are using feature branches effectively, because they create them whenever a series of changes is going to be “long” or “disruptive.” Setting aside the issue for a moment that this decision is most likely made by committee, it’s very difficult to pin down definitions of long and disruptive. I spent most of the last week avoiding checking out because of a series of changes going on that were not originally anticipated to be long and disruptive. With DVCS, you create branches even for the short and simple stuff. If it turns out to be long and disruptive, there’s no change in the process.
A lot of companies create a new branch for each release. As some people are finishing bug fixes, others are ramping up on the next project. We don’t want to branch too early, because then all the bug fixes will have to be merged over, but if we branch too late, we delay the start of the next project. What if the guys who finished their product early can branch at a different time than the guys who are frantically fixing the last minute bugs? Distributed repositories makes this possible, even natural. Also, when every bug fix is a local branch, it’s much easier to merge it into the branches for two different releases.
CVCS lets you set permissions on certain branches, but in practice they are very rarely employed to the granularity they need to be, and tend to be either overly permissive or overly restrictive. DVCS lets you set different permissions on different repositories. In case it isn’t apparent how this is useful, I’ll give you an example that I think is fairly common.
Certain areas of our code are deemed crucial enough that they can only be checked in through a select few gatekeepers. However, the gatekeepers are not always the ones who write the code changes. This last week we had some changes go in that had interdependencies with changes in that crucial code. Trying to get all the pieces checked in through and from different people resulted in staggered uncompilable check ins all week, which is why I was afraid to check out. It would have been much easier if we had a separate repository with open permission on that crucial code to do all that integration work, then get it submitted to a gatekeeper to commit to the official repository in one fell swoop.
Spheres of Disruption
Having more than one repository helps limit the amount of disruption caused by creating and deleting branches. For example, how many active branches do you suppose there are in the Linux kernel development? Dozens or even hundreds probably, but if you go to any given repository, you’ll see only the few that most matter to you. A reluctance to organize and wade through that many branches on a central repository creates a natural tendency to keep the number of branches as low as possible.
When you get down to it, the smallest sphere of disruption is an individual developer. Since I can have my own repository on my desktop, people couldn’t care less about how many branches I keep around for my own purposes. I’ve actually been using a DVCS alongside my company’s official CVCS for that very reason. I make temporary branches all the time to quickly check out that bug a tester just saw, test out someone’s changes I’m reviewing, continue my work while my own check ins are held up for some reason like a code review, or for keeping a compilable baseline around while the central one is broken.
If any of you centralized fans have ever created a branch at work for a common yet useful purpose like that, I would love to hear about it. For some people, even if the policies allowed it, the simple fact that everyone can see those branches would often prevent them from doing so. Nothing beats trying DVCS for yourself to experience the complete psychological freedom of being able to create as many branches as you want for whatever reason you want.
In conclusion, while the technical differences are not permanent, there are a number of social factors that will continue to give DVCS a large advantage in employing more powerful branching models.
Almost every discussion about distributed version control systems (DVCS) on the Internet includes at least one post along the lines of “DVCS can’t merge binary files and my project has binary files.” There is a good reason why you might not want to use DVCS for binaries, but contrary to popular belief, DVCS not being able to merge them isn’t it. My purpose here is to try to convince you why, with some exercises you can try yourself at home.
Most of the confusion arises from not understanding the difference between merging binary files and merging branches containing changed binary files. The former no version control software can do, and the latter DVCS can do just as well as any other VCS, if not better.
I can already feel the heat from people preparing their replies. This is one of those things that people will argue about forever even when they’re wrong, because it seems so obvious that they don’t need to bother trying it for themselves. In anticipation of this, I’ve prepared a short demonstration.
Here’s the scenario: Bob is working on updating the documentation on his company’s website for their upcoming software release. Bob branches from the staging branch so he can work on updating the screenshots to move the minimize/maximize/close buttons to the left. It takes Bob longer than the original estimate, because he didn’t realize you could just take new screenshots instead of editing the old ones in Gimp, but he eventually merges his branch back into the staging branch.
At this point we have one branch that says buttons go on the left and one that says buttons go on the right. Obviously a merge conflict, right? Wrong. The merge algorithm knows the screenshots haven’t changed in the staging branch since Bob branched off and does the right thing. Don’t believe me? Try it for yourself. I’ll wait.
Some people get it in their head that this will work for text files, but not binary files, because binary files aren’t “mergeable.” Note that in this scenario, the merge algorithm doesn’t care if the files are mergeable or not.
That’s not to say there aren’t scenarios where mergeability matters, but with binary files you hope you never get into that situation, because no version control can get you out of it. If Alice is changing the same screenshots at the same time as Bob, there’s no way to merge them automatically.
To help out people who don’t like scary solutions like communicating with your coworkers, most centralized version control software lets you place a lock on a file. Because none of the major distributed software has this lock feature yet, people claim it’s because it’s fundamentally impossible with DVCS.
While it’s true locking requires communication with a central lock authority, there’s no need for that to be in the same place as everything else, nor is there a need to be in constant contact with that central authority. If people spent as much time implementing the feature as they do whining about the lack of it, every DVCS implementation would have had locking years ago.
As I mentioned, there is one good reason why you might not want to adopt DVCS for your binary files. Binary diffs tend to be larger than text diffs, and with DVCS all the history gets stored on every developer’s computer. However, you shouldn’t assume that every change will increase your repository size by 100% of the binary file size. In my test for the Bob scenario, it only increased about 36% for Bazaar. You also shouldn’t assume that all that history is being copied every time you make a new branch. All the major software lets you share the diffs between local branches, and although the initial checkout may take a while, after that only the changes are communicated.
In conclusion, if you have been avoiding evaluating DVCS because of the binary myth, you might want to give it a second look and actually try it out on your own files. You may still find CVCS to be a better fit, but at least that decision will be based on evidence. On the other hand, I think you have a good chance of being pleasantly surprised.
I previously discussed how distributed version control systems (DVCS) can help with keeping the tip always compilable. DVCS is also useful in making sure the tip always passes a test suite, or maintains any other standard of quality. It does this by giving more control where control is needed.
When you first hear about DVCS, that statement seems counter-intuitive. How can a decentralized system give more control to the central authorities? The key is that by giving up some control where it wasn’t needed, you gain more control over the important parts, sort of like guarding a prisoner in a 10×10 cell instead of a 10 acre field.
In our company we have product-specific code for a number of embedded products, and a large base of code shared between products. Because developers typically only have the hardware for the product they are working on, someone who makes changes to the shared code can only test it on that one product. As a result, although breaking your own product’s build is quite rare, shared code changes that break other products’ builds are much too common.
So how can we set up our branches to mitigate this problem? The answer lies in examining who we want to have control over each branch. At the same time, we think about what our ideal log would look like.
We want the “official” branch for a product to consist of a series of successfully tested builds. We want to be able to take any revision of that branch at any time with confidence. Obviously, the person best suited to controlling that branch is the lead tester for the product. The log at that level would look something like this:
Here we have the log for a fictional Product A. Notice we only have one person committing here, the lead tester for product A. This responsibility could be rotated among all the testers, and could be enforced by only giving the product A testers write permissions on the branch, or more loosely enforced just by social convention. The important thing to notice is that the test group has more control over the product’s official branch than the typical centralized model, where all developers have commit access.
Okay, so where do the developers come in? Developers like to have control, but it looks like you just took a whole bunch of control away from them. For that we expand the log to the next level:
At this level you can clearly see which features made it into each promoted build. Developers for product A have full control over this development branch and can set permissions on it as they see fit. This includes preventing the test group from writing to this branch if desired, because all they need is to be able to do is pull. In other words, each group has full control over exactly the areas they need it. A developer’s view in their daily work looks like this:
This shows the changes for Product A as if Product A is the most important product in the world. All the shared changes from Product B are hidden behind the plus sign, which you only click to expand if you want to see the details. A developer on Product B would see a similar view of Project B’s development branch, as if Product B is the most important product in the world.
Here you can also see two possible approaches to receiving changes from the shared code. One is what Amy did in revision 2.2.1. For her change, she knew she needed some changes in shared code from Product B in order to proceed with her work, so she merged them in. The other alternative is an assigned branch manager thinking it’s been a while since we synced up, so he more formally merges the changes in. You can do both if you want.
Notice that either way the developers for product A have full control over when shared changes get pulled into their product’s build. If the shared changes cause a product-specific compile or run time error, Amy simply doesn’t commit them until she has worked with the Product B developers to get it resolved. In the mean time, Alice, Arnold, and the test team are all working from a clean baseline.
Another method we use to improve code quality is code reviews. We use an online collaborative review tool, and it generally takes a few days to finish one. In the mean time, you start work on your next change, going back as needed to fix the defects found in the review. Turns out DVCS is useful in this situation as well, because we can create a new branch for our work as soon as the review is started, like so:
In conclusion, distributed version control offers many flexible ways to increase build stability through more local control and judicious design of a branching model. What branching models have you found to be successful?