Building Linux Beowulf Clusters Mike Perry (mikepery@fscked.org) Updated 09/26/00 This document aims to provide an overview of Beowulf clusters, from cluster design, to cluster use and maintainance. This document was written while I was under hire by , and as such, spends a good deal of time describing how to maintain and use the specific type of cluster I built for them. Introduction What is a Beowulf Cluster?

Essentially, any group of Linux machines dedicated to a single purpose can be called a cluster. To be called a Beowulf cluster, all you need is a central node that does some coordination (Ie, technically an IP Masq'ed network fits loosely into the Beowulf definition given by the )

What can a Beowulf Cluster do?

A beowulf cluster is capable of many things that are applicable in areas ranging from data mining to reasearch in physics and chemistry, all the way to the movie industry. Essentially anything that can perform several semi jobs concurrently can benifit from running on a Beowulf Cluster. There are two classes of these parallel programs.

Embarrassingly Parallel Computation

A Beowulf cluster is best suited for "embarrassingly parallel" tasks. In other words, those tasks that require very little communication between nodes to complete, so that adding more nodes results in a near-linear increase in computation with little extra synchonization work. Embarrassingly parallel programs may also be known as linearly scalable. Esentially, you get a linear performance increase for each additional machine with no sign of deminishing returns.

are one example of a application of embarassingly parallel computation, since each added node can simply act as another "island" on which to evaluate fitness of a population. Rendering an animation sequence is another, because each node can be given an individual image, or even sections of an image to render quite easily even using standard programs like .

Explicitly Parallel Computation

Also applicable are explicitly parallel programs, that is programs that have explicity coded parallel sections of their code that can run independantly. These differ from fully parallelizable programs in that they contain one or more interdependant serial or atomic components that must be syncroninzed across all instances of the parallel application. This synchonization is usually done through libraries such as , , or even extended versions of POSIX threads and .

What can't a Beowulf Cluster do?

A common misconception is that there is a specific set of beowulf software that you can run on a group of Linux machines, and viola, everything runs faster. This is not the case. While there is software to turn a cluster of Linux workstations into what seems like a single computer with a single process id space, this is impraticle for a single user setup, as even the most hardcore poweruser can only run (otherwise unparallelizable) CPU intensive jobs for a short period of time before letting the machine fall idle. Also, the way these systems are implemented also makes them impracticle even to speed up a simple "make -j".

So essentially, when thinking about implementing a Beowulf cluster, the two questions to ask are "Are lots of independant programs going to be running" or "Is the extra speed worth rewriting my programs? Are they even parallelizable?" "Are my programs CPU bound?" All Beowulf configurations will slow down disk or memory-bound applications.

Types of Beowulf Clusters

There are several ways to design a Beowulf. I'm going to try to break down the types by network, hardware, and software configuration, since you really can pick one from each. Under each section will be a quick blurb about the best price/performace technology to use. Please note that the terms I use here are my own, so I can refer to them easily. (Also, most of them I made up cause they sound cool ;)

Classification by Network Architecture

For all cluster types, I would recommend switched 100BaseT as the price/performance sweet spot. Gigiabit switches and cards are roughly 10 times more expensive currently than 100BaseT, and for embarassingly parallel jobs, you won't need this bandwith, as your communication should be minimal. If you are implementing a -based cluster, Gigabit networking might be something to consider, as MOSIX is more network bound than your typical GA. NetType I - Single External Node

In this configuration, you basically have one single entry point to the cluster, ie one monitor and keyboard, and one external IP address. The rest of the cluster is usually then behind a normal . Users are usually then encouraged to login only to the main node, and spawn remote jobs with only ssh.

NetType II - Multinode/Nondedicated

In this configuration all nodes are equivalent from a network standpoint. They all have external IP addresses, and usually all have keyboards and monitors. Usually this configuration is chosen so that nodes can also double as desktop workstations, and thus have external IP addresses in their own right. If you don't need to do this, doing a NetType I configuration is recommended.

Classification by System Architecture

For all cluster types, I recommend x86 hardware. As shitty as it is, it really is the best price/performance ratio, especially when you're dealing with near-linearly scalable algorithms. For example, even if a DEC Alpha 21264 was 3 times faster than an AMD K7T (which it is not), you can buy about 4 Thunderbirds for the price of one Alpha. (Trust me, I looked. Thunderbirds run as low as $900 a piece with 256 RAM, while a 21264 will run you at LEAST $3600, and thats if you find a half dozen stolen ones on eBAY). ArchType I - Local Synchronized disk

In this configuration, all nodes have local disks. In addition, these disks are kept in sync nightly by an job that updates pretty much everything except /var, /tmp, and /etc/sysconfig. In this configuration, extra scratch space and /home can be optionally NFS mounted across all clusters.

ArchType II - Local Nonsynchronized disk

In this configuration, all nodes have local disk, but they are not kept in sync. This is most useful for disk-independant embarrassingly parallel setups that merely do number crunching, and no disk-based syncronization.

ArchType III - NFS root

This configuration is most useful for those who wish to save money on disks for all nodes, and wish to avoid the headaches of having to keep a few dozen to a few hundred disks in sync. This option is actually quite a reasonable choice, especially for programs that need some disk syncronization, but aren't otherwise disk-bound.

Classification by Synchronization Software

Now this is the important part. Choosing your software essentially determines the use of the cluster. For example, you don't want to go full fleged with MOSIX if all you are doing is setting up a rendering farm or a GA. SoftType I - Batch System

Basically a batch system is one where you just send it a job, and it does it. Usually only one job is run at a time, and job scheduling is left to the programmer/job runner. Sometimes a queue or at least a launching script is provided through some simple bash scripts, otherwise remote processes are launched manually one at a time on each node, or through ssh. Needless to say, this is the easiest software type to set up, and also the least overhead. It is the Recommended Buy (tm) for embarrasingly parallel apps.

SoftType II - Preemptive Scheduler/Migrator

This next class contains systems that automatically schedule and migrate processes based on cluster status. This kind of setup really is geared more towards those who just want to set up a cluster for a general mass-login system, rather than those who want to do distributed programmnig. The two software packages available that provide this ability are and . Condor isn't officially Open Source yet, and Mosix seems far more full featured for this purpose. In fact, Mosix allows you to even build a NetType II/ArchType II style cluster and still use each node for cluster jobs. In addition, Condor places a lot of limitations on the types of jobs that can be run across the entire cluster, where as Mosix is meant to be entirely transparent. Plus Mosix is Open Source :) Technically neither of these technologies is part of the traditional "Beowulf" scheme, but they are significant enough that they must not go without mention in any Beowulf document.

SoftType III - Fine Grained Control

The last beowulf software implementation class actually has some overlap with SoftType I. This is the Fine Grained Control section, where individual programs themselves control the synchronization, load balancing, etc. Often times these jobs are launched through a SoftType I method, and then syncronized using one of the standardized source-level libraries already available. These include the industry standard and. Also availble are an extention to SysV IPC, called , and a more control-oriented remote-forking system called . Of course, to utilize this method, you must (re)write your software to use one of these libraries.

Building the System

As I have said, I've built two clusters so far. The NCSA cluster was a NetType I, ArchType I, SoftType I 8 node, 12 processor cluster, which I actually built with the help of . The Illigal cluster is a NetType I, ArchType III, SoftType I/III. Unfortunately, the NCSA reinstalled the cluster (with NT, ugh.. Is that place ever going down the tubes :() and deleted all our scripts after I left, so the best I can do is talk about what we did there. I think I like the Illigal setup better anyways. It's cheaper and feels cleaner (plus synchronization is automatic).

Some assumptions: I based both my clusters on Linux 6.2. Redhat takes a lot of flak from people, but out of all the distributions I've seen, they actually have the cleanest setup. In addition, you will find it very helpful if you use a secondary package manager, such as or . These programs allow you to package third party programs (such as scripts you write) into their own directory in /usr/local/encap, and then maintain the symbolic links into /usr/local. Really comes in handy when you do maintainance. Building the NetType

This is pretty straightforward. The only method that needs any explaining is the NetType I. Read the , and have a look at my . Essentially, I determine the kernel version in and run the appropriate scripts for either kernel 2.4 or 2.2.

The ipmasq.init is run only on the world node. The eth0 interface is configured as normal with the external IP, and then use in the scripts to put the world node on both the external and internal (10.x.x.x) network segments. The subnodes are given IP's in the internal network. The method for doing this, however, depends on the ArchType. Building the ArchType

ArchType I - Local Synchronized disk

Achieving local syncronized disks is a bit of a pain. We did it though, but unfortunately, like I said, our work was deleted.

To install, esentially we set up 8 floppies, put a mini root filesystem with a kernel, bash, and replaced /sbin/init with a shell script that basically fetched several tarballs that comprised a complete RH6.2 system (one for each major directory heirarchy), and allowed you to tune some paramaters (or use preset values from a saved config file). Essentially, we did something like (cept we wrote it, so it was cooler ;)

We then NFS exported /home from the world node, as well as each machines /scratch partition. Each machine had all the others /scratch's mounted on /scrach/machinename, and it's own scratch was /scratch/machinename.

was run nightly from cron to keep non-nfs stuff synced up, so you only had to install an RPM on the root node, and it would propogate to the subnodes by nighttime.

ArchType II - Local Nonsynchronized disk

This configuration is easy to set up, of course. Just install the machines however you want, and don't worry about it! :)

ArchType III - NFS root

This is the configuration used in the Illigal cluster. Essentially, the way I recomend setting things up is to use dhcp, and then either a NFS boot floppy, or if you're lucky, PXE to boot of of an Intel EtherExpress Pro 10/100+.

This actually caused me quite a bit of grief. Floppies proved so unrelaible as to stop booting after one or two boots, and PXE documentation is sparse and incomprehensive. Eventually, I settled on using . Syslinux contains a package called that you use in conjunction with Intel EtherExpress Pro/100+ ethernet cards to do remote booting.

To set up a boot floppy, you basically create a like you would do for ArchType I, except you pass some special nfsroot arguments to the kernel, as you can see in my . You also have to mknod <nfs,boot255> c 0 255 for lilo to not complain, as well as cp -a those other couple devices from /dev. You then can install lilo on this floppy with lilo -r /mnt/floppy.

Booting off a PXE capable bootrom is a bit harder, but much more reliable. Esentially, if you follow the , you should be alright. You're going to need to fetch , because the default tftpd supplied with RH Linux 6.2 does not support tsize. Installing tftpd-hpa is straightforward. Just replace in.tfpd in /etc/inetd.conf with the full path to the new tftpd. I would also recommend changing "wait" to "wait.600" or so, because when all the machines boot at once it will trigger inetd's flood protection and shut down the tftp service.

After this is done, you're going to want to write a . The PXElinux documentation is a little unclear on how to set up tftp serving. Basically you copy pxelinux.bin into /tftpboot, and then put your in /tftpboot/pxelinux.cfg/. You can have a seperate config file for each IP, but unless you have a heterogenious cluster, there isn't much point. However, I would recommend hard coding the dhcpd.conf file like I did, because if you have any other machines on the network that try to dhcp, they could steal your cluster IP's, and thats no good. The best way to start is to dynamically allocate IPs from the subnet, and then once you get every machine's eth address from arp, just cut and paste into dchp.conf.

Of course, neither floppy nor PXE will work properly unless you configure your kernel to support these options. The main two options you need for are kernel level autoconfiguration (under networking options), and then NFS filesystem under filesystems (In-kernel, no module), and root filesystem on NFS (will appear under NFS filesystem IFF kernel autoconfiguration is chosen). In addition, if you are going to use PXE-style booting, I have noticed that kernels < 2.2.17 seem to give CRC errors. Either comment out the CRC check in ./lib/inflate.c:gunzip(), or get 2.2.17.

As you can see from the kernel append options, the disk attempts to NFS mount /export/<IP address> off of 10.0.0.1. Technically, you can have all machines mount a single root directory in /export. I did try this, and things seemed to work. But be warned, locking in /var will become unreliable, as will shutting down and certain other system scripts, since redhat keeps a list of running subsystems in /var/lock/subsys. So if you have multiple systems adding and deleting from that directory, well it ain't good. The way to create those root filesystem mirrors is to mount -o remount -r / and /var, and then dd if=/dev/<root partition> of=rootdev, and likewise for vardev. You can then mount -o loop these files, and then cp -a the mounted directories to all the IPs in /export/. This is the easiest way I can think of to be able to copy only / and /var.

. Its slow, insecure, unstable, and there are a lot of kernel layers to go through (ie it takes much more memory to NFS swap than to swap normally. And guess when you swap..). Simply leave diskless nodes swapless.

Now there's just one more step to actually get these nodes to boot. You have to change some things in the standard RH6.2 init scripts to work with NFS root filesystems. You need to in , and then you have to change the order of startup scripts slightly in and in order to leave the networking and RPC stuff up. Also, finally, the needs in order to not kill off RPC in the killall phase, and to successfully unmount the NFS filesystems.

Your final fstab file should look . Additionally, you will want to modify each of the /export/<IP address>/etc/sysconfig/network files to have the correct hostname for the export IP. Building the SoftType

I have only set up SoftType I clusters. If you have any experiance with MOSIX, Condor, MPI, or PVM, please mail the completed sections (in SGML if possible) to . SoftType I - Batch System

In order to set up a batch system, you first need some method of farming out processes to all nodes in the cluster. Most books mention using rsh. If you use rsh on your cluster, I will personally hunt you down and skin you alive. With the death of the RSA patent, and the availability of , there is no reason why you shouldn't use ssh to run remote jobs, other than pure laziness. Setting up ssh to do password authentication requires 3 steps. Frist, you must edit your sshd_config file to set RhostsRSAAuthentication to yes. Then, you have to take each of the hosts keys from ssh_host_key.pub and prefix it with the hostname,IP and put this into ssh_known_hosts. Finally, you must add each hostname to /etc/shosts.equiv, and optionally to /root/.shosts if you want passwordless root login and launch. Creating a symlink from /etc/shosts.equiv to /root/.shosts should work for all implementations.

From this point, it is relatively easy to write that will launch a specific command on every node, give the status of the cluster, or shut it down. On Illigal, I wrote , which shuts down all nodes of the cluster but the main one, , which executes a command on all nodes of the cluster, and , which just runs uptime on each node.

SoftType II - Preemptive Scheduler/Migrator

Again, I don't have any experiance with setting up this type of cluster, other than to point you at the and sites. If you end up building this type of cluster and would like to contribute a blurb for use here describing the pitfals and gotchas of such a setup, please .

SoftType III - Fine Grained Control

Likewise with fine grained control methods. I actually did try out on the NCSA cluster, but I can't remember what was involved. I do know that bproc is meant only as a method of remote forking (from within C programs). It does provide a unified process space, but does not do load balancing, locking, or synchronization. So in my opinion, ssh probably supercedes bproc's features (although not necessarily in convienience) at the moment. In addition, bproc hasn't had a version update since I installed that cluster, which was over a year ago. Other Notes

In addition to the architectures described above, you may find it handy to install some method of password synchronization among all the nodes, such as kerberos or NIS. I installed on the Illigal cluster. One stumbling block I came across was that I had to use 10.0.0.1 as the address of the domain server in the yp.conf files of the subnodes, since for some reason it wouldn't use /etc/hosts to resolve the node hostname.

Another method is to mount the root node's /etc directory on each node as well, and them simply symlink that passwd, shadow and group file into /etc. This might be a bad idea though, because tools might check for symlink versions of those files for security reasons. System Maintainance Planing for the Future

System maintainance is a very important part of cluster design that most books leave out. When you build your cluster, as a rule of thumb, don't do anything "by hand" to any of the nodes when setting them up. Usually, following this rule will cause you to do things in a formal scripted or packaged manor, thus ensuring easy maintainance. Always Use Package Managers

As I said before, I recommend installing every non-distribution package via , including scripts you write. This will make it easy to handle multiple versions of programs, and allow you to rapidly "roll back" if something goes wrong.

In a ArchType I system, you just install the RPM or encap, and then wait a day for rsync to take care of things. In an ArchType II system like Illigal, I recommend putting the RPM in somewhere exported to all nodes of the cluster, like /usr/src/redhat/RPMS/i386/, and then running "allnodes rpm -Uvh /usr/src/redhat/RPMS/i386/rpmname.rpm". Technically you are reinstaling most of the same files over and over again, but you do need to update the rpmdb on each node, and you also want to make sure that files in /lib and /bin get updated too. Adding a Node

Adding an ArchType I (Synchronized disk) node

FIXME: TODO

Adding an ArchType III (NFS root) node

To add a NFS root node, assuming that the main node is all prepaired already, your changes should be minimal, but tedious.

First, you need to give the machine a root directory to mount in /export/<IP address>. Just copying an exisiting directory will be fine. Just be sure to modify the etc/fstab and etc/sysconfig/network to have the proper entries for hostname and IP for the new node. Be sure to add an entry to /etc/exports on the root node for this new directory.

Next, you need to determine it's ethernet hardware address and give it an IP address in (One way to determine the hardware address is to boot it first by uncommenting out that "range" section for the machine to get an IP dynamically, then booting it and running ifconfig).

Finally, you have to tell all the other machines about the node. This is the tedious part. You have to add the hostname of the new node to all shosts.equiv files on each export directory, as well as the main nodes /etc/shosts.equiv. You have to do the same for the etc/hosts files. Other Notes

On the Illigal cluster, I found it handy to keep the kernel config files of the nodes in /usr/src/MachineType.kconfig. This makes upgrading the kernels easier, since you just copy the appropriate config file to .config, and run make oldconfig.

In addition, I also wrote a script called which basically is a simple while loop that iterates on each of the NFS export directories sets a vol variable to each of the export directories. So to do something like copy in all the kernel modules to each directory, you would to allexports cp -a /lib/modules/2.2.17 \$vol/lib/modules/. You have to escape that $, or the shell will sub in $vol before the command runs. It also comes in handy if you need to edit a file on each of the nodes, or even launch another script (in which case you don't have to escape the $ from inside the script). $i is exported as well, and is simply the number 2->15. Other Sources of Info Where to Find this Document

The latest version of this document is at . It is written in . Books

I've read most of and I was left mostly unsatisfied. My goal in writing this paper was to do a better job. If you still want to buy this book, be my guest. URLs

If most of this document was over your head, perhaps you should consult the and the .

For more info on Beowulfs, consult the , the , and.

For info on parallel processing in general, consult the