Unix Spiders: November 2010

I attended a very interesting talk on Thursday 25th November by Dan North of DRW Trading entitled “Is DVCS ready for the enterprise”. There are some important trends in open source software configuration management/source code control (SCC) that I would like to highlight. Here are my thoughts from the talk.

There is increasing interest in Distributed Version Control Systems (DVCS), in both the commercial world and in open source software development. Dan suggested that actually a more interesting question to ask is ‘Is the enterprise ready for DVCS’? His view in the talk (my interpretation) is that DVCS is ready for the enterprise, but are there work patterns in the enterprise that can use DVCS?

The initial problem he outlined is that more and more code is being version controlled, with much bigger teams over a wider geographical area in different time zones. Developers can be in the same floor or building, or be in different continents and time zones. Put simply the centralised model of SCC just isn’t working any more.

He provided a brief history of SCC and showed the development of open source tools:

1972: SCCS – the first SCC tool (file oriented)
1986: RCS – introduced the concept of multiple file access
1986: CVS – built on RCS by providing current access for distributed users
2001: SubVersion – repository oriented.

The key issue here is the problem of file oriented SCC methods (the focus up to CVS). CVS tags all files. This leads to various problems including tagging new versions (its all or nothing for all files) and problems with atomicity (file oriented methods are problematic for resolving conflicts) etc. SubVersion tries to get round the problem by focusing on the repository level, and tagging only at that level.

There are a number of commercial players who are better able to tackle these including: ClearCase, StarTeam, PerForce, VSS etc. There were some strong views from the floor on the ability of these systems to tackle the problems above, and Dan agreed, but pointed out that is focus was on OS SCC systems. However all systems mentioned so far have a centralised file base.

Dan mentioned some alterative systems which are distributed or peer to peer (p2p) based:

1997 – Code Co-op
2001 – GNU Arch (now defunct)
2003 – PARCS (Haskall based – for reasoning on ChangeSets)

However everything changed in 2005 with the ‘BitKeeper’ incident. BitKeeper was a propriety but free program which was used by the Linux team for SCC. BitMover (the company who developed BitKeeper) decided end free access and started to charge for their product. This lead directly to the Linux community spearheaded by Linus Thorvolds developing a DVCS called ‘git’. Concurrently another group started another DVCS system called ‘Mercurial’. Very much a case of ‘Scratching an Itch”, something that Feller and Fitzgerald pointed out in their book in 2002 when pointing to that CVS was rather behind the times in terms of SCC, particularly with respect to propriety systems. In contrast, Dan’s view is that propriety systems did not take notice of these new developments, and OS now has the lead – this tends to happen if OS developers get the bit between their teeth and people annoy them! Things can change very quickly in the OS world.

More interestingly OS development is now being driven by two new websites:

Gibhub [https://github.com/] – based on ‘git’
BitBucket [http://bitbucket.org/] - based on ‘Mercurial’

Dan contrasted these websites with ones that have been around for a while, FreshMeat and SourceForge code repositories. It was his view that FreshMeat overtook SourceForge and that these new sites will take over from SourceForge, and will be the focus OS software development in the future. The most interesting thing about Github and BitBucket is that they are based on the social networking model, as embodied by sites such as FaceBook and LinkedIn. So while there are similarities of Github and BitBucket to FreshMeat and SourceForge, there are some very important differences.

In order to illustrate the significance of these developments, Dan contrasted the Centralised model of SCC and compared it with the p2p model (DCVS).

The centralised model consists of a Hub and Spoke. The master copy of the SCC data is held on the Hub, while the spoke represents the local copies held by developers on their computer. The model is very much ‘pull’ rather than ‘push’. Working copies are shared via the Hub (Master). Nothing gets past the master! In the centralised model is easier to control:

User access: through the http protocol
Build and Release: single point of access

In contrast the p2p model makes all copies (repositories) equal, and copies are just a series of clones. You either have a clone or don’t – each developer has their own clone with all information provide, and full access rights to the code base. Network operations are local, and therefore faster. Branching and merging – a big issue in SCC – are also local. ‘Diffing’ on copies provides more information. Committing is done locally.

In terms of data repositories share ‘ChangeSets’ in the p2p model, rather than changes on individual files and directories in those directories as in the centralised model (CVS has a tree, and changes between tree’s are what drives that SCC system). What is significant about this issues is that publishing is decoupled from committing. This raises the question – isn’t this a recipe for chaos?

There are a number of issues to consider here:

Build must be deterministic and repeatable
Configuration management audit and traceability: who did what, when?
Organisational structure of the team: central model allows access to trees, various parts of a team can access their part of the tree, but you can’t do this in the p2p model. How do you cope with team structure in DCVS?

How do teams integrate their code together to create a coherent software build? The modular style of development is the key to success in any software project, for example by allowing specialists to concentrate on their own particular area of concern. This development style is very prominent in OS projects. One way to this is to centralise by convention. Dan gave an example of the Linux Kernel, were many developers are involved in the development (1000’s). This project very much drove the development of git and Mercurial. One repository is ‘Canonical’, as master copy of which there is only one, and only one person can commit to. Other uses can only download this Canonical copy. Different specialist groups have their own Canonical and a hierarchy of commits is created, with different levels of peer review appropriate for each level. All releases come from the central repository. Commits are done locally, where synchronisation is done globally. Centralising by convention makes synchronisation much easier. The Bazaar model of development is clearly influential in the p2p model of SCC.

Dan talked about the fundamental difference between the centralised and p2p models in terms of the data they process. Centralised SCC systems (such as CVS) are file oriented, whereas p2p DVCS systems (such as git) are ChangeSet orienated.

In the ChangeSet model, there is no concept of a file. Renames and deletes on files and directories are no longer special. Only changes are recorded – this decoupling is important. The advantage of ChangeSets is that we can use ChangeSet Algebra on ChangeSets to find the differences and act on them e.g. to restore parts of the code previously deleted. With ChangeSets only the changes are stored, which is much more efficient than changes on individual files and directories (in the CVS tree for example), and saves a substantial amount of storage space and is easier to reason with. This then is the big advantage of ChangeSets, and hence the p2p model.

In order to illustrate the advantage of the p2p model and its data storage model, Dan talked about the incremental merge problem. In centralised systems, incremental merges can cause the multiple initiations of delete commands (for example), leading to inconsistencies in the repository from actions taken on them. Various members of the audience asserted that this has not been a problem for propriety systems, but Dan countered that CVS/SubVersion did not deal with the problem, and git/Mercurial are significant advances over the former.

Dan also talked about the difference in release from centralised and p2p systems, the former branch per release, whereas the latter branch per feature. He talked very briefly about migrating to DVCS.

Dan outlines some barriers to adoption:

git has a significant learning curve
its easy to overlook the synchronisation issue
it is much harder to enforce central control.

Dan concluded that the technology works and DVCS is most definitely ready for the enterprise. I found this talk extremely illuminating, and will be revising my session on OSS tools accordingly.

Update: Audio of the talk is available on the BCS streaming server.

References

J. Feller & B. Fitzgerald - Understand open source software development, Addison-Wesley, 2002.

Relevance feedback is a very useful tool to either modify or expand queries according to user needs, or to provide an example with which to start a search. The former tends to be used in text retrieval, the latter in image retrieval.

An example on the web is TinEye, provided by a company by the name of Idée.

It works by finding an exact copy of an image you specify, either via upload or via a link. It does not find near images, the result is binary - either the image matches or it doesn't. Some small differences are allow for example, searching for Mona Lisa in curlers will bring up the original version of the Mona Lisa. One interesting part of the service is that if you provide a web page, it will load the images, and allow you to click on them to conduct a search. The service is useful for tasks which require known item search - e.g. I've got this image, has someone used this before on the web, I need an original image.

(h/t: Tanya)

Unix Spiders

Monday, 29 November 2010

Monday Blues Cure

Sunday, 28 November 2010

Current trends in Software configuration Management

Thursday, 25 November 2010

Image Search by Example

Monday, 22 November 2010

Monday Blues Cure

Monday, 15 November 2010

Jamendo free music website

Monday Blues Cure

Monday, 8 November 2010

Monday Blues Cure

Monday, 1 November 2010

Monday Blues Cure

Twitter

About Me

Blog Archive

Labels