Subscribe to
Posts
Comments

One of the particularly nice things about working with a distributed revision control tool these days is that I can sidestep the choice of winning tool. Thanks to Scott Chacon and Augie Fackler’s excellent hg-git extension, I can use Mercurial and collaborate almost seamlessly with git users. This is exactly what I did when working with Johan on the new I/O manager subsystem in GHC 7, and the experience was generally very smooth.

The only mild annoyance has been that I’d prefer to also not be forced to choose a hosting winner: although bitbucket is pretty good, github is currently far slicker, and has a much larger community of potential collaborators.

I’ve hosted most of my code on bitbucket for quite a while. Until this morning, I had a somewhat awkward way to mirror code to github. I just automated the problem away.

My automation scheme is implemented as a Mercurial hook. Here’s how I’ve enabled it in my $HOME/.hgrc file:

[hooks]
post-push = python:/home/bos/share/python/github_mirror.py:post_push

That github_mirror.py hook is very simple. Every time I push, it checks to see if I'm pushing to a bitbucket repository, and if so checks my local repo's .hg/hgrc file to see if I have a mirror on github. If I do, it pushes to github, too.

1
2
3
4
5
6
7
8
9
10
from mercurial import commands

def post_push(ui, repo, pats, opts, *args, **kwargs):
dest = pats and pats[0]
dest = ui.expandpath(dest or 'default-push', dest or 'default')
if 'bitbucket.org' in dest:
github = ui.config('paths', 'github')
if github:
return commands.push(ui, repo, github, **opts)
ui.warn('no github mirror!?\n')

How do I tell Mercurial that I have a girhub mirror? In a repo's .hg/hgrc file, I have something like this (taken from a real repo):

[paths]
default = http://bitbucket.org/bos/pcap
default-push = ssh://hg@bitbucket.org/bos/pcap
github = git+ssh://git@github.com/bos/pcap.git

The github_mirror.py hook looks for that github key in the paths section of the file, and uses it if present.

Over the past couple of years, since David Roundy handed over control of the darcs revision control system to a talented team of developers, it has come a long way in stability and performance.

I began using darcs essentially out of necessity, simply because it was the revision control system of choice for much of the Haskell community. I have always found my use of it to be somewhat clunky, and lately I’ve been migrating most of my remaining Haskell projects away from it.

In day to day use, in a single-developer repository on a single machine, darcs has always been pretty decent, and in fact a few aspects of its user interface have been influential. Its record command, for instance, is sufficiently nifty that I wrote the record extension for Mercurial. I'm fairly sure that my enthusiasm for record also led me to ask for the feature that became the interactive change selection feature of the marvelous TortoiseHg GUI. Even git acquired a similar change selection feature at some point.

However, for a team of developers, or even a single individual working across several machines, darcs isn't so great. It presents two challenges that I find troublesome: dealing with concurrent conflicting changes to files is a mess, and its much-touted patch theory often makes poor engineering sense.

My idea of a good merge process is the one I get with Mercurial (git and svn can be configured to act similarly, but don't present nearly as nice a default experience). In brief: I start a merge by hand, most of it proceeds automatically, and I get dropped into a nice GUI whenever something funky needs fixing up. With darcs, dealing with conflicts is messy, confusing, and something I usually get wrong. I'm not alone in this; several of my collaborators dread submitting patches to GHC because it's so hard to merge changes against a fast-moving upstream. After enough bad experiences, I have a subconscious view of a darcs merge as a way to lose bits of work.

As for patch theory, my principal point of discomfort with it is that I have little idea what was in someone's repository when they created a patch. People grouse about the fact that the history of a Mercurial project often contains a lot of merges, but those merges mean that I can reconstruct what another developer was able to see when they were writing code. That's a long way to say "merges are debugging gold", especially when coupled with liberal use of the bisect command. The rebase command provided by git and Mercurial provides a nice balance: history is explicit by default, but can be tidied if necessary.

I am not by any means trying to belittle the work of the darcs developers. Darcs is a good piece of software, and I appreciate the work its team has put into it over the past few years. I've soldiered along with it for long enough that I could obviously do so indefinitely, but the experience I have with other tools is now sufficiently better that I simply don't want to use darcs. For day to day work, TortoiseHg is the best GUI I've had the good fortune to use with any revision control tool. It makes committing changes and browsing history a snap. And for collaboration, of course both github and bitbucket are fabulous. I particularly appreciate the hg-git plugin, which lets me collaborate almost seamlessly with git users, while enjoying Mercurial's UI. Unfortunately, the darcs community has nothing on a comparable level of quality for either fluid use or collaboration.

Ersin Er wrote a brief blog post about handling the Turkish language in Haskell. Because Turkish uses a character set that mostly looks familiar to Westerners, it is notorious for its ability to trip up the unwary programmer (see examples in PHP and PostgreSQL).

1
2
3
4
5
6
7
8
import Data.Text (pack, unpack)
import Data.Text.ICU (LocaleName(Locale), toLower)

main = do
let trLocale = Locale "tr-TR"
let upStr = "ÇIİĞÖŞÜ"
let lowStr = unpack $ toLower trLocale $ pack upStr
putStrLn ("toLower " ++ upStr ++ " gives " ++ lowStr)

His example is quite nice, but we can write more compact version of his code using a few handy features of the text and text-icu packages:

  • In the text-icu library, we use the LocaleName type to describe the locale in which we want a function to operate. This type is an instance of the IsString class, so if we enable the OverloadedStrings language feature, we can write plain "tr-TR" to specify a Turkish locale.

  • The Text type is also an instance of the IsString class, so we can write a literal string like "foo" and the compiler will infer the correct type for it.

  • The Data.Text.IO module contains functions for performing locale-sensitive I/O using Text values.

This combination of features can let us write a less cluttered program, following the dictum that simple things should be simple:

1
2
3
4
5
6
7
8
{-# LANGUAGE OverloadedStrings #-}
import Data.Text.IO as T
import Data.Text.ICU as T (toLower)

main = do
let upper = "ÇIİĞÖŞÜ"
lower = T.toLower "tr-TR" upper
mapM_ T.putStr ["toLower ", upper, " gives ", lower, "\n"]

I've intentionally kept the number of lines the same to preserve clarity, but there are a few advantages to the rewrite:

  • Less clutter, more speed: we don't need to explicitly pack or unpack Text values to or from String values.

  • Performance: we're not performing I/O on String values. This would be a big deal if we were writing a real application: I/O with Text is much faster than with String.

  • Putting inference to work: the compiler correctly infers the type of "tr-TR" to be a LocaleName, and of the strings at the end to be Text, so we don't need to be so explicit.

Oh, and we still give the right answer (look carefully at upper and lower case dotted and dotless "I"):

toLower ÇIİĞÖŞÜ gives çıiğöşü

The full documentation to the text and text-icu libraries is a little difficult to read on Hackage (in fact, the text-icu API docs are completely missing), so here are links:

I spent some time today trying to talk to a MySQL database server from a piece of middleware I'm writing in Haskell. You might think that talking to a database server would be easy, but it turned out to be quite a bother.

Both of the major MySQL bindings, HDBC-mysql and HDBC-odbc, use the libmysqlclient C library behind the scenes. With GHC's unthreaded runtime, which is still the default, an application using either will work fine. However, my middleware app is highly concurrent and uses software transactional memory (STM) to manage some shared state, and I have to use the threaded runtime. This is where my troubles began.

The symptom I observed was that I couldn't even connect to a database:

SqlError {
  seState = "", 
  seNativeError = 2003, 
  seErrorMsg = "Can't connect to MySQL server on 'xxxxx' (4)"}

After enough years of dealing with MySQL, you pick up some useful nuggets such as "the number in parentheses at the end of certain kinds of error message is a Unix errno value" (the library doesn't provide any other way to see what errno caused a failure, amusingly enough). The number 4 is EINTR, indicating that a system call was being interrupted.

I split my development time between a Mac and a Linux laptop, and today's hacking was on a Mac, so I fired up dtruss to see what was wrong:

dtruss -b128m myapp

(I'd much preferred to have been using Linux here. dtruss is vastly inferior to strace, and in fact in its default configuration, it doesn't work at all! That -b128m is necessary to give its kernel component enough of a scratchpad that it won't run out of space while sampling.)

The interrupted system call was connect, and sure enough, reading the library source code, we can see that the problem lies in the my_connect function:

  /*
If they passed us a timeout of zero, we should behave
exactly like the normal connect() call does.
*/

if (timeout == 0)
return connect(fd, (struct sockaddr*) name, namelen);

The comment is more or less accurate, but the library should be more careful in its use of the connect function: the caller of my_connect doesn't check for EINTR, and so the connection will fail if the thread receives a signal.

Why is the thread receiving a signal in the first place, though? GHC's threaded RTS sets up either a SIGALRM or SIGVTALRM signal to perform some internal book-keeping at a fairly high frequency, and it's the arrival of this signal that interrupts connect. Failure to check for EINTR and retry is a widespread problem in C code that uses system calls directly.

To work around this, I wrote a simple module that masks the RTS signals that the MySQL client library fails to handle, then performs an action. It ensures that it's running in a bound thread (GHC terminology for a lightweight thread that's tied to a heavyweight system thread) for the duration of the action.

{-# LANGUAGE EmptyDataDecls, ForeignFunctionInterface #-}

module RTSHack (withRTSSignalsBlocked) where

import Control.Concurrent (runInBoundThread)
import Control.Exception (finally)
import Foreign.C.Types (CInt)
import Foreign.Marshal.Alloc (alloca)
import Foreign.Ptr (Ptr, nullPtr)
import Foreign.Storable (Storable(..))

#include <signal.h>

withRTSSignalsBlocked :: IO a -> IO a
withRTSSignalsBlocked act = runInBoundThread . alloca $ \set -> do
sigemptyset set
sigaddset set (#const SIGALRM)
sigaddset set (#const SIGVTALRM)
pthread_sigmask (#const SIG_BLOCK) set nullPtr
act `finally` pthread_sigmask (#const SIG_UNBLOCK) set nullPtr

data SigSet

instance Storable SigSet where
sizeOf _ = #{size sigset_t}
alignment _ = alignment (undefined :: Ptr CInt)

foreign import ccall unsafe "signal.h sigaddset" sigaddset
:: Ptr SigSet -> CInt -> IO ()

foreign import ccall unsafe "signal.h sigemptyset" sigemptyset
:: Ptr SigSet -> IO ()

foreign import ccall unsafe "signal.h pthread_sigmask" pthread_sigmask
:: CInt -> Ptr SigSet -> Ptr SigSet -> IO ()

Here's a useful little tip if you need to use hg convert to generate a stripped-down copy of a Mercurial repository. For instance, maybe we have a tree that someone committed a large file to by accident, or perhaps someone accidentally checked some closed source code into an open source tree. If such a commit makes it into a busy repository, it can be a while before anyone notices. Worse, we can't necessarily expect everyone who's downstream of that repository to immediately drop everything they're doing and find a way to switch over to the freshly-scrubbed tree you want to publish.

There's a fairly painless way around this. Firstly, we'll need to use the filemap option to make hg convert strip out the files we want to lose from our new repository. Here's an idea of how that should work. To start off, we create a demo repository, named hg-100.

$ hg clone -r 100 http://www.selenic.com/repo/hg hg-100
$ ls -F hg-100
comparison.txt	hgweb.py     mercurial/  PKG-INFO  setup.py  tkmerge
hg		MANIFEST.in  notes.txt	 README    tests/

Next, we set up a filemap that eliminates all tests from the tree.

$ echo exclude tests > no-tests.map
$ hg convert --quiet --filemap no-tests.map hg-100 nukeme
$ hg --cwd nukeme update --quiet
$ ls -F nukeme
comparison.txt	hgweb.py     mercurial/  PKG-INFO  setup.py
hg		MANIFEST.in  notes.txt	 README    tkmerge

Now that we have run hg convert, it's changed the changeset ID of every commit that either contains a change to a file in tests or is a child of such a commit. Ouch, right? Haven't we now lost the ability to merge contaminated repositories with the scrubbed one?

Fortunately, no. We could always pull future changes into hg-100, then rerun hg convert with the same --filemap option, and successfully scrub later changes, which we could then share with our collaborators.

The only trouble with this is that it relies on a non-version-controlled file named .hg/shamap in our nukeme tree. That file contains the mapping from source changeset ID to target changeset ID. This is what hg convert uses to tell whether a changeset has been seen before, and if so, what its new ID is. So if we lose the nukeme tree, we can't run hg convert again, and we're stuck, right? Well, once again, not necessarily.

A final trick up our sleeves is to run every instance of hg convert with --config convert.hg.saverev=True. This is documented in the output of hg help, but sadly not on the wiki page. This saves each source changeset ID in a special field in the corresponding changeset in the target repo.

$ rm -rf nukeme
$ hg convert --quiet --filemap no-tests.map --config convert.hg.saverev=True \
    hg-100 nukeme
$ hg --cwd nukeme tip --debug | grep 'extra:'
extra:       branch=default
extra:       convert_revision=526722d24ee5b3b860d4060e008219e083488356

The convert_revision data gives us enough to create a new .hg/shamap file whenever we need to. First, we create a style file to extract the necessary data from Mercurial.

$ cat >> pants.map << EOF
changeset = '{extras}\n'
extra = '{key} {value} {node}\n'
EOF

Then, in any clone of our converted repo, it becomes simple to regenerate the .hg/shamap file:

$ hg log -r0: --style pants.map | awk '/convert_revision/{print $2,$3}' \
    > $(hg root)/.hg/shamap

And finally: remember, kids, if you like this kind of information, not only is my Mercurial book free to read online, but if you buy a paper or ebook copy, all the royalties go to the Software Freedom Convervancy, whose work is so worthy of support you should buy five copies, not just one.

« Prev - Next »