As I have said, I've built two clusters so far. The NCSA cluster was a NetType I, ArchType I, SoftType I 8 node, 12 processor cluster, which I actually built with the help of Kristopher Wuollett. The Illigal cluster is a NetType I, ArchType III, SoftType I/III. Unfortunately, the NCSA reinstalled the cluster (with NT, ugh.. Is that place ever going down the tubes :() and deleted all our scripts after I left, so the best I can do is talk about what we did there. I think I like the Illigal setup better anyways. It's cheaper and feels cleaner (plus synchronization is automatic).
Some assumptions: I based both my clusters on RedHat Linux 6.2. Redhat takes a lot of flak from people, but out of all the distributions I've seen, they actually have the cleanest setup. In addition, you will find it very helpful if you use a secondary package manager, such as encap or GNU Stow. These programs allow you to package third party programs (such as scripts you write) into their own directory in /usr/local/encap, and then maintain the symbolic links into /usr/local. Really comes in handy when you do maintainance.
This is pretty straightforward. The only method that needs any explaining is the NetType I. Read the IP MASQing HOWTO, and have a look at my scripts. Essentially, I determine the kernel version in ipmasq.init and run the appropriate scripts for either kernel 2.4 or 2.2.
The ipmasq.init is run only on the world node. The eth0 interface is configured as normal with the external IP, and then use IP aliasing in the scripts to put the world node on both the external and internal (10.x.x.x) network segments. The subnodes are given IP's in the internal network. The method for doing this, however, depends on the ArchType.
Achieving local syncronized disks is a bit of a pain. We did it though, but unfortunately, like I said, our work was deleted.
To install, esentially we set up 8 floppies, put a mini root filesystem with a kernel, bash, and replaced /sbin/init with a shell script that basically fetched several tarballs that comprised a complete RH6.2 system (one for each major directory heirarchy), and allowed you to tune some paramaters (or use preset values from a saved config file). Essentially, we did something like KickStart (cept we wrote it, so it was cooler ;)
We then NFS exported /home from the world node, as well as each machines /scratch partition. Each machine had all the others /scratch's mounted on /scrach/machinename, and it's own scratch was /scratch/machinename.
Rsync was run nightly from cron to keep non-nfs stuff synced up, so you only had to install an RPM on the root node, and it would propogate to the subnodes by nighttime.
This configuration is easy to set up, of course. Just install the machines however you want, and don't worry about it! :)
This is the configuration used in the Illigal cluster. Essentially, the way I recomend setting things up is to use dhcp, and then either a NFS boot floppy, or if you're lucky, PXE to boot of of an Intel EtherExpress Pro 10/100+.
This actually caused me quite a bit of grief. Floppies proved so unrelaible as to stop booting after one or two boots, and PXE documentation is sparse and incomprehensive. Eventually, I settled on using SysLinux. Syslinux contains a package called PXElinux that you use in conjunction with Intel EtherExpress Pro/100+ ethernet cards to do remote booting.
To set up a boot floppy, you basically create a
mini root filesystem like you would do for ArchType I, except
you pass some special nfsroot arguments to the kernel, as you can see in my
lilo.conf. You also have to mknod
<nfs,boot255> c 0 255 for lilo to not complain, as well as cp -a those other
couple devices from /dev. You then can install
lilo on this floppy with
lilo -r /mnt/floppy.
Booting off a PXE capable bootrom is a bit harder, but much more reliable. Esentially, if you follow the PXElinux documentation file, you should be alright. You're going to need to fetch tftpd-hpa, because the default tftpd supplied with RH Linux 6.2 does not support tsize. Installing tftpd-hpa is straightforward. Just replace in.tfpd in /etc/inetd.conf with the full path to the new tftpd. I would also recommend changing "wait" to "wait.600" or so, because when all the machines boot at once it will trigger inetd's flood protection and shut down the tftp service.
After this is done, you're going to want to write a dhcpd.conf. The PXElinux documentation is a little unclear on how to set up tftp serving. Basically you copy pxelinux.bin into /tftpboot, and then put your configuration files in /tftpboot/pxelinux.cfg/. You can have a seperate config file for each IP, but unless you have a heterogenious cluster, there isn't much point. However, I would recommend hard coding the dhcpd.conf file like I did, because if you have any other machines on the network that try to dhcp, they could steal your cluster IP's, and thats no good. The best way to start is to dynamically allocate IPs from the subnet, and then once you get every machine's eth address from arp, just cut and paste into dchp.conf.
Of course, neither floppy nor PXE will work properly unless you configure your kernel to support these options. The main two options you need for nfsroot are kernel level autoconfiguration (under networking options), and then NFS filesystem under filesystems (In-kernel, no module), and root filesystem on NFS (will appear under NFS filesystem IFF kernel autoconfiguration is chosen). In addition, if you are going to use PXE-style booting, I have noticed that kernels < 2.2.17 seem to give CRC errors. Either comment out the CRC check in ./lib/inflate.c:gunzip(), or get 2.2.17.
As you can see from the kernel append options, the disk attempts
to NFS mount /export/<IP address> off of 10.0.0.1. Technically, you can
have all machines mount a single root directory in /export. I did try this, and
things seemed to work. But be warned, locking in /var will become unreliable,
as will shutting down and certain other system scripts, since redhat keeps a
list of running subsystems in /var/lock/subsys. So if you have multiple
systems adding and deleting from that
directory, well it ain't good. The way to create those root filesystem mirrors is to
mount -o remount -r / and
/var, and then
dd if=/dev/<root partition> of=rootdev,
and likewise for vardev. You can then
mount -o loop these files, and then cp
-a the mounted directories to all the IPs in /export/. This is the easiest way
I can think of to be able to copy only / and /var.
DO NOT PUT SWAP ON NFS. Its slow, insecure, unstable, and there are a lot of kernel layers to go through (ie it takes much more memory to NFS swap than to swap normally. And guess when you swap..). Simply leave diskless nodes swapless.
Now there's just one more step to actually get these nodes to boot. You have to change some things in the standard RH6.2 init scripts to work with NFS root filesystems. You need to remove the bootup fsck check in rc.sysinit, and then you have to change the order of startup scripts slightly in rc3.d and rc6.d in order to leave the networking and RPC stuff up. Also, finally, the halt script needs some tweaking in order to not kill off RPC in the killall phase, and to successfully unmount the NFS filesystems.
Your final fstab file should look something like this. Additionally, you will want to modify each of the /export/<IP address>/etc/sysconfig/network files to have the correct hostname for the export IP.
I have only set up SoftType I clusters. If you have any experiance with MOSIX, Condor, MPI, or PVM, please mail the completed sections (in SGML if possible) to me.
In order to set up a batch system, you first need some method of farming out processes to all nodes in the cluster. Most books mention using rsh. If you use rsh on your cluster, I will personally hunt you down and skin you alive. With the death of the RSA patent, and the availability of openssh, there is no reason why you shouldn't use ssh to run remote jobs, other than pure laziness. Setting up ssh to do password authentication requires 3 steps. Frist, you must edit your sshd_config file to set RhostsRSAAuthentication to yes. Then, you have to take each of the hosts keys from ssh_host_key.pub and prefix it with the hostname,IP and put this into ssh_known_hosts. Finally, you must add each hostname to /etc/shosts.equiv, and optionally to /root/.shosts if you want passwordless root login and launch. Creating a symlink from /etc/shosts.equiv to /root/.shosts should work for all implementations.
From this point, it is relatively easy to write a couple scripts that will launch a specific command on every node, give the status of the cluster, or shut it down. On Illigal, I wrote beodown, which shuts down all nodes of the cluster but the main one, allnodes, which executes a command on all nodes of the cluster, and status, which just runs uptime on each node.
Again, I don't have any experiance with setting up this type of cluster, other than to point you at the Condor and Mosix sites. If you end up building this type of cluster and would like to contribute a blurb for use here describing the pitfals and gotchas of such a setup, please do so.
Likewise with fine grained control methods. I actually did try out bproc on the NCSA cluster, but I can't remember what was involved. I do know that bproc is meant only as a method of remote forking (from within C programs). It does provide a unified process space, but does not do load balancing, locking, or synchronization. So in my opinion, ssh probably supercedes bproc's features (although not necessarily in convienience) at the moment. In addition, bproc hasn't had a version update since I installed that cluster, which was over a year ago.
In addition to the architectures described above, you may find it handy to install some method of password synchronization among all the nodes, such as kerberos or NIS. I installed YP/NIS on the Illigal cluster. One stumbling block I came across was that I had to use 10.0.0.1 as the address of the domain server in the yp.conf files of the subnodes, since for some reason it wouldn't use /etc/hosts to resolve the node hostname.
Another method is to mount the root node's /etc directory on each node as well, and them simply symlink that passwd, shadow and group file into /etc. This might be a bad idea though, because tools might check for symlink versions of those files for security reasons.