Converting a Mercurial tree repeatedly with files removed

Here's a useful little tip if you need to use hg convert to generate a stripped-down copy of a Mercurial repository. For instance, maybe we have a tree that someone committed a large file to by accident, or perhaps someone accidentally checked some closed source code into an open source tree. If such a commit makes it into a busy repository, it can be a while before anyone notices. Worse, we can't necessarily expect everyone who's downstream of that repository to immediately drop everything they're doing and find a way to switch over to the freshly-scrubbed tree you want to publish.

There's a fairly painless way around this. Firstly, we'll need to use the filemap option to make hg convert strip out the files we want to lose from our new repository. Here's an idea of how that should work. To start off, we create a demo repository, named hg-100.

$ hg clone -r 100 hg-100
$ ls -F hg-100
comparison.txt     mercurial/  PKG-INFO  tkmerge
hg  notes.txt	 README    tests/

Next, we set up a filemap that eliminates all tests from the tree.

$ echo exclude tests >
$ hg convert --quiet --filemap hg-100 nukeme
$ hg --cwd nukeme update --quiet
$ ls -F nukeme
comparison.txt     mercurial/  PKG-INFO
hg  notes.txt	 README    tkmerge

Now that we have run hg convert, it's changed the changeset ID of every commit that either contains a change to a file in tests or is a child of such a commit. Ouch, right? Haven't we now lost the ability to merge contaminated repositories with the scrubbed one?

Fortunately, no. We could always pull future changes into hg-100, then rerun hg convert with the same --filemap option, and successfully scrub later changes, which we could then share with our collaborators.

The only trouble with this is that it relies on a non-version-controlled file named .hg/shamap in our nukeme tree. That file contains the mapping from source changeset ID to target changeset ID. This is what hg convert uses to tell whether a changeset has been seen before, and if so, what its new ID is. So if we lose the nukeme tree, we can't run hg convert again, and we're stuck, right? Well, once again, not necessarily.

A final trick up our sleeves is to run every instance of hg convert with --config convert.hg.saverev=True. This is documented in the output of hg help, but sadly not on the wiki page. This saves each source changeset ID in a special field in the corresponding changeset in the target repo.

$ rm -rf nukeme
$ hg convert --quiet --filemap --config convert.hg.saverev=True \
    hg-100 nukeme
$ hg --cwd nukeme tip --debug | grep 'extra:'
extra:       branch=default
extra:       convert_revision=526722d24ee5b3b860d4060e008219e083488356

The convert_revision data gives us enough to create a new .hg/shamap file whenever we need to. First, we create a style file to extract the necessary data from Mercurial.

$ cat >> << EOF
changeset = '{extras}\n'
extra = '{key} {value} {node}\n'

Then, in any clone of our converted repo, it becomes simple to regenerate the .hg/shamap file:

$ hg log -r0: --style | awk '/convert_revision/{print $2,$3}' \
    > $(hg root)/.hg/shamap

And finally: remember, kids, if you like this kind of information, not only is my Mercurial book free to read online, but if you buy a paper or ebook copy, all the royalties go to the Software Freedom Convervancy, whose work is so worthy of support you should buy five copies, not just one.

Posted in mercurial
3 comments on “Converting a Mercurial tree repeatedly with files removed
  1. James T says:


    Thanks for this article and post. Part of it helped with my extraction of a project sub-folder from a large repo into it’s own repo. (Windows XP)
    However, it did not copy the source files – had to do it manually and then Commit/Add all files.
    On all the posts/articles I have seen the impression is given that hg update will copy the files from the source folder – but it didn’t in my case.
    Just wondering if this is an issue under Win XP.

    BTW: I have bought the book – good reference source! Seemed handier than printing out chunks of the on-line stuff.

  2. Gary Kramlich says:

    I’ve ran through all of this, but can see to push/pull between the two converted repos at all. Am I missing something?

  3. Rodica says:

    I moved to mercurial malniy due to the fact that I didn’t want to move off of google code. Another reason was that I prefer MacHG over GitX (but now with Xcode4 supporting got that might have been even better). Finally when I initially researched about dvcs I got the impression that mercurial was a bit easier to understand.

Leave a Reply

Your email address will not be published. Required fields are marked *