There is increasing interest in Distributed Version Control Systems (DVCS), in both the commercial world and in open source software development. Dan suggested that actually a more interesting question to ask is ‘Is the enterprise ready for DVCS’? His view in the talk (my interpretation) is that DVCS is ready for the enterprise, but are there work patterns in the enterprise that can use DVCS?
The initial problem he outlined is that more and more code is being version controlled, with much bigger teams over a wider geographical area in different time zones. Developers can be in the same floor or building, or be in different continents and time zones. Put simply the centralised model of SCC just isn’t working any more.
He provided a brief history of SCC and showed the development of open source tools:
- 1972: SCCS – the first SCC tool (file oriented)
- 1986: RCS – introduced the concept of multiple file access
- 1986: CVS – built on RCS by providing current access for distributed users
- 2001: SubVersion – repository oriented.
There are a number of commercial players who are better able to tackle these including: ClearCase, StarTeam, PerForce, VSS etc. There were some strong views from the floor on the ability of these systems to tackle the problems above, and Dan agreed, but pointed out that is focus was on OS SCC systems. However all systems mentioned so far have a centralised file base.
Dan mentioned some alterative systems which are distributed or peer to peer (p2p) based:
- 1997 – Code Co-op
- 2001 – GNU Arch (now defunct)
- 2003 – PARCS (Haskall based – for reasoning on ChangeSets)
More interestingly OS development is now being driven by two new websites:
- Gibhub [https://github.com/] – based on ‘git’
- BitBucket [http://bitbucket.org/] - based on ‘Mercurial’
In order to illustrate the significance of these developments, Dan contrasted the Centralised model of SCC and compared it with the p2p model (DCVS).
The centralised model consists of a Hub and Spoke. The master copy of the SCC data is held on the Hub, while the spoke represents the local copies held by developers on their computer. The model is very much ‘pull’ rather than ‘push’. Working copies are shared via the Hub (Master). Nothing gets past the master! In the centralised model is easier to control:
- User access: through the http protocol
- Build and Release: single point of access
In terms of data repositories share ‘ChangeSets’ in the p2p model, rather than changes on individual files and directories in those directories as in the centralised model (CVS has a tree, and changes between tree’s are what drives that SCC system). What is significant about this issues is that publishing is decoupled from committing. This raises the question – isn’t this a recipe for chaos?
There are a number of issues to consider here:
- Build must be deterministic and repeatable
- Configuration management audit and traceability: who did what, when?
- Organisational structure of the team: central model allows access to trees, various parts of a team can access their part of the tree, but you can’t do this in the p2p model. How do you cope with team structure in DCVS?
Dan talked about the fundamental difference between the centralised and p2p models in terms of the data they process. Centralised SCC systems (such as CVS) are file oriented, whereas p2p DVCS systems (such as git) are ChangeSet orienated.
In the ChangeSet model, there is no concept of a file. Renames and deletes on files and directories are no longer special. Only changes are recorded – this decoupling is important. The advantage of ChangeSets is that we can use ChangeSet Algebra on ChangeSets to find the differences and act on them e.g. to restore parts of the code previously deleted. With ChangeSets only the changes are stored, which is much more efficient than changes on individual files and directories (in the CVS tree for example), and saves a substantial amount of storage space and is easier to reason with. This then is the big advantage of ChangeSets, and hence the p2p model.
In order to illustrate the advantage of the p2p model and its data storage model, Dan talked about the incremental merge problem. In centralised systems, incremental merges can cause the multiple initiations of delete commands (for example), leading to inconsistencies in the repository from actions taken on them. Various members of the audience asserted that this has not been a problem for propriety systems, but Dan countered that CVS/SubVersion did not deal with the problem, and git/Mercurial are significant advances over the former.
Dan also talked about the difference in release from centralised and p2p systems, the former branch per release, whereas the latter branch per feature. He talked very briefly about migrating to DVCS.
Dan outlines some barriers to adoption:
- git has a significant learning curve
- its easy to overlook the synchronisation issue
- it is much harder to enforce central control.
Update: Audio of the talk is available on the BCS streaming server.
J. Feller & B. Fitzgerald - Understand open source software development, Addison-Wesley, 2002.