System maintainance is a very important part of cluster design that most books leave out. When you build your cluster, as a rule of thumb, don't do anything "by hand" to any of the nodes when setting them up. Usually, following this rule will cause you to do things in a formal scripted or packaged manor, thus ensuring easy maintainance.
As I said before, I recommend installing every non-distribution package via encap, including scripts you write. This will make it easy to handle multiple versions of programs, and allow you to rapidly "roll back" if something goes wrong.
In a ArchType I system, you just install the RPM or encap, and then wait a day for rsync to take care of things. In an ArchType II system like Illigal, I recommend putting the RPM in somewhere exported to all nodes of the cluster, like /usr/src/redhat/RPMS/i386/, and then running "allnodes rpm -Uvh /usr/src/redhat/RPMS/i386/rpmname.rpm". Technically you are reinstaling most of the same files over and over again, but you do need to update the rpmdb on each node, and you also want to make sure that files in /lib and /bin get updated too.
FIXME: TODO
To add a NFS root node, assuming that the main node is all prepaired already, your changes should be minimal, but tedious.
First, you need to give the machine a root directory to mount in /export/<IP address>. Just copying an exisiting directory will be fine. Just be sure to modify the etc/fstab and etc/sysconfig/network to have the proper entries for hostname and IP for the new node. Be sure to add an entry to /etc/exports on the root node for this new directory.
Next, you need to determine it's ethernet hardware address and give it an IP address in dhcpd.conf (One way to determine the hardware address is to boot it first by uncommenting out that "range" section for the machine to get an IP dynamically, then booting it and running ifconfig).
Finally, you have to tell all the other machines about the node. This is the tedious part. You have to add the hostname of the new node to all shosts.equiv files on each export directory, as well as the main nodes /etc/shosts.equiv. You have to do the same for the etc/hosts files.
On the Illigal cluster, I found it handy to keep the kernel config files of the nodes in /usr/src/MachineType.kconfig. This makes upgrading the kernels easier, since you just copy the appropriate config file to .config, and run make oldconfig.
In addition, I also wrote a script called
allexports
which basically is a simple while loop that iterates on each of the NFS export
directories sets a vol variable to each of the export directories. So to do
something like copy in all the kernel modules to each directory, you would to
allexports cp -a /lib/modules/2.2.17 \$vol/lib/modules/. You have
to escape that $, or the shell will sub in $vol before the command runs. It
also comes in handy if you need to edit a file on each of the nodes, or even
launch another script (in which case you don't have to escape the $ from
inside the script). $i is exported as well, and is simply the number 2->15.