At work, we spent some months wringing our hands over how to harness our increasing number of racked machines in a semi-sane way. In our case, since we work in the HPC arena, we needed a batch queuing system.
Ten years ago, I wrote and maintained such a system at Sun. It managed the execution of jobs on about 2,200 computers, and had a user community of several hundred. Although it was a fine piece of software, Sun never did much with it, even though for a while it was substantially superior to Platform’s LSF, and could have made a modestly successful product.
Instead, a few years after I left, Sun bought a company called Gridware, and tried to sell its similar product, which they named GridEngine. Sun’s timing was poor, as the bottom fell out of the market for their hardware around the time they acquired Gridware (in 2000), and market acceptance of GridEngine was probably not helped by its utter awfulness as a piece of software.
Sun ended up open-sourcing GridEngine; it has seen some modest success in this capacity, primarily since its competition is even worse.
For a long time prior to the appearance of GridEngine, the only free batch system was a package called PBS. Development of PBS forked several times, and all of its variants have long been notoriously difficult to set up and keep running. About the best of the forks is named Torque.
Even Torque is impressively painful to deal with, however, to the extent that I despaired of it after a mere hour of reading the installation and configuration instructions. It was clear that it would annoy me endlessly with its sheer clunkiness, and prove far more trouble than it was worth for a small computing facility.
A little under two years ago, Robert brought up GridEngine on a small cluster, for our hardware team to run simulations and verification jobs. Although it worked (and in fact they still use it), he was scarred enough by the experience that when I started to think about deploying a batch system more widely, he encouraged me to look for alternatives.
Whereupon I came across SLURM, a resource manager developed at LLNL. SLURM is under active development, is easy to use, works quite well, and most importantly to your harried author, it hasn’t been a nightmare to configure or manage. (Strong praise, that.)
I would rank SLURM as the best of the three open source batching systems available, by rather a large margin. It has some fragility problems and peculiarities in the way it works, but they’re easy to forgive due to its generally sane construction.