|
Hybrid Clustering
A pair of machines that share process load, data, and services between them,
using only commodity hardware and free software.
Clustering on a Budget!
by Tom Kunz
This article first appeared in Jan 2004 at Linux Gazette.
The current release of TKCluster has gone through several improvements
since then, please see the TKCluster Home Page and Release Notes for updated
information.
Table Of Contents
- Definitions
- History
- Purpose
- Credits
- Design Philosophy
- Hardware Components
- Software
- Setting Up
- Step One: Compute Clustering
- Step Two: Data Replication
- Step Three: Fail-Over
- TKCluster Configuration
- Things Left To Do
- Conclusion
- About the Author
A few definitions are in order before starting a discussion on
clusters:
-
Cluster
-
A group of one or more machines connected to one another and sharing
one or more services. The two most popular types of clusters are
generally known as computational clusters, or
high-performance computing (HPC) where processes and/or
specific computations are split up and shared among many machines, and
high-availability (HA) clusters where access to a certain
service or group of services is shared so that if any single machine
in the cluster is taken offline for any reason, it does not affect the
ability of users to use that service. Computational clusters are also
sometimes called "Beowulf clusters". Early Beowulf clusters used
tools like rsh to execute parts of a parallel computation on various
nodes. The problem with this was that no single node knew what was
happening on the other nodes, or if other nodes were heavily loaded or
not. MOSIX is a much more modern computational clustering tool
allowing multiple processes to be moved to other systems in the
cluster, and keeps track of the loads on various nodes.
The cluster described in this document is designed to be a synthesis
of both cluster types. Both MOSIX HPC clustering and data-replicating
cluster types are merged in this cluster design.
(FOOTNOTE: In the full spectrum of HPC clustering, just having two
machines connected for sharing process load is no great achievement.
However, for small workgroup or small business loads, having double
the computing horsepower available is a side benefit. If you are
really looking for true HPC clustering in the mega- or tera-FLOP
range, the MOSIX clustering described here can be expanded to hundreds
or thousands of nodes.)
- Service Clustering
-
Service clustering is a very specific type of cluster which
refers to using multiple machines to give the illusion of "high
availability" for a particular server. One of the most popular
illustrations of this is a webserver cluster. A typical way to set
this up is to configure up a number of machines identically, all
running a local copy of the webserver, and then using a "trick" with a
specialized DNS server or router which, essentially, round-robbins the
connections from the outside world into a different machine (or the
least-loaded machine) on each access from the outside. A machine that
dies or is taken offline is noted by the DNS server or router and is
ignored until it shows it is ready again. This "trick" works fine
when each of the webservers themselves are accessing files and
databases on a shared "backend" server, however the question then
arises, "Well, what kind of cluster is that database engine living
on?" The answer is that the "backend" server is either not a cluster
at all (ie, if it fails, the whole web cluster fails), or relies upon
some expensive, proprietary hardware and/or software. In some
respects, service clustering almost doesn't really qualify as a
true cluster because it still has a single point of failure,
the "backend" server. But it certainly
looks good for Linux distribution vendors to advertise "Linux
cluster" products, without addressing the real issue of data
replication and fail-over. This marketing technique was very popular
in the later 1990's among certain Linux distribution vendors seeking to
make some quick bucks on the buzzword.
- Device-Level Replication / Filesystem-Level Replication
-
Device-level refers to data replication occurring on a
block-device. Block writes occur on, for example, /dev/hda1, a
block device. This block device need not contain any specific data
structure or be formatted in any way. In contrast to this is
filesystem-level where files on a formatted filesystem are
being read and written, for example, to the /opt filesystem.
Device-level replication may be used for duplicating the data from
something like an Oracle or Informix database partition. Oracle,
Informix, and other database engines sometimes boast performance
advantages over what they would be if used on formatted filesystems.
In recent years, it seems this belief has been challenged, with
filesystem-based engines giving comparable performance to raw-device
based engines.
Filesystem-level replication means that the two replicated filesystems
are probably not identical at the block-level, however they contain
the same files. Periodic synchronization via rsync, which walks the
directory structure tree looking for changed file contents, is the
most common way to achieve filesystem-level replication. However,
this is an extremely inefficient way to attempt clustering, and is
fraught with peril. Using rsync does not work well with large files
or files with contents that change frequently. Also, should rsync be
interrupted for any reason, the file it was duplicating will be left
as a "stray" on the system (ie, the next invocation of rsync won't
realize that the temp file the previous invocation created is actually
an rsync file, and will duplicate it over again). Rsync uses temp
file during duplication, and so an interruption of rsync means a
duplicate of a duplicate is left behind. Also, because rsync must was
the entire filesystem tree specified looking for changes, it's highly
inefficient and ends up looking for changes it need not worry about.
It works well for duplicating data over slow links, such as the
internet, and with long intervals between refreshes, such as once a
day. Rsync can be relied upon to synchronize online archives and
mirrors periodically, but it doesn't seem suited to clustering. If
it's set up to work as a cluster with long delays between rsync
invocations, data is lost if failure occurs between rsyncs. Also, if
rsync fails on large files (ie CD-ROM images) the failure leaves
behind large temporary files which need to be cleaned away by hand
after recovery. If the delay between rsyncs is shortened to be
nearly-continuous, the machine is wasting most of its time examining
files which need no examination. When large files are written to disk
on the master, continuous rsyncs separated with short delays will
cause the disk to thrash heavily until the file is written on the
master. If someone has found a way to use rsync more efficiently,
please let me
know.
There were a small number of proprietary vendors in the mid to late
1990's who advertised commercial systems which performed either
device-level or filesystem-level replication. Two such products were
SavWareHA (formerly known as Sentinel) for SCO UNIX and Legato
(formerly Vinca) Co-Standby Server for Windows NT.
In ages past, there have been several attempts to build cluster
systems, of varying degrees of success and failure. Solutions from
vendors like DEC, IBM and Sun relied upon specialized, proprietary,
expensive hardware and software. Within the Linux community, there
has always been a lack of almost any kind of true cluster,
although the earliest successes in Linux clustering was specifically
the early MPI / PVM computational clusters and in the realm of
webserver clusters. Unfortunately, both MPI/PVM style clusters and
webserver clusters are "niche" products which do not really fully
address all aspects of clustering (ie fail-over and data replication).
The expensive route to go with clustering and shared-storage and/or
data replication has been in the form of a pair of machines with an
external array of disks between them. To do this, the servers must be
in the same proximity (within a few meters of each other, typically in
the same rack or neighboring racks), and each server must have
redundant connections to the external storage array. This usually
meant using either Fibre Channel or SCSI controllers, at least two per
machine, both of which plugged directly into the external disk array.
The external array must have at least two available, independent
connections into it per server in the cluster. Shared SCSI and Fibre
Channel solutions using this technique are not uncommon, but also not
cheap. A minimal configuration would consist of two servers, two SCSI
RAID controllers in each for a total of 4 controllers, an external
JBOD case with 2 electrically-independent SCSI buses and at least 2
disks in it. This is very specialized hardware. Usually this kind of
cluster is sold by high-margin server vendors such as IBM, HP, or Sun.
The following graphic illustrates the typical layout:
These types of clusters are truly effective, they work splendidly.
But only if you have the cash to set them up in the first place. A
5-figure pricetag would be typical for an entry-level system, and a
6-figure pricetag would conceivably be the next step up in the
technology. A drawback to the high-availablility was both cost and
the fact that the machines had to be located in the same physical
area. With high cost comes the questions asked by those who hold the
purse strings in the organization: "How much is my data really
worth? Can I afford to be out of service for a little while until the
server is fixed? Is it worthwhile to spend 6-figures on this server?
How do I prioritize which services and data are important enough to
get a slice of this server and which is left out in the cold on a
single, non-clustered server?". By using shared SCSI, servers had to
sit very near each other in order to work, and stepwise degradation in
performance and reliability was the cost of moving just a few feet too
far. Ethernet, on the other hand, can connect two PCs relatively
cheaply, across very long distances without speed or reliability
problems. Despite the ubiquity of relatively cheap PC's, Linux, and
ethernet hardware, until recently there were no really good data
replication clusters that took advantage of the power of commodity
hardware.
The purpose of this document is to fully describe and enumerate the
steps involved in building a hybrid cluster which takes advantage of
both computational clustering and device-level, high-availability
clustering using nothing more than standard, open-source software and
commodity hardware. None of the software used in this costs a thing,
and the hardware used in this arrangement should be available for
relatively cheap prices from most computer parts stores or online
retailers. Basically, this document should allow you to take two
commodity PC's running Linux and make them think and act as one highly
reliable storage cluster that can handle a pretty hefty user load.
Also, because ethernet is being used as the interconnect instead of
SCSI, servers can be physically isolated from each other in the event
of catastrophe, and placed hundreds of yards apart, or even further
with the proper switch or repeater in place.
This cluster would not be even remotely possible without the efforts
of these people:
- Philipp Reisner: author of DRBD, the soul of the data-replication
portion of the cluster. Many thanks for his work on DRBD.
- The MOSIX Team: Headed up by Professor Amnon Barak at the Hebrew
University in Jerusalem, Israel, the MOSIX project gives ordinary
people and businesses the ability to build their own supercomputing
cluster from commodity PC's at a tiny fraction of the cost of a
proprietary system. Thanks for creating such a cool clustering technology!
- Peter Braam: Although I don't believe he wrote any of the software
directly, he pointed me in the direction of using DRBD. He is
THE source for info on real filesystem clustering. He is the
author of the Lustre filesystem from Cluster File Systems,
Inc. Lustre is still relatively young and deserves much
exploration. I heard of it at Linux World in Jan 2003 and attended
lectures by Peter Braam and his colleagues, and I would encourage
anyone else interested in clustered filesystem to take a look at his
work. Lustre powers MAJOR clusters, while my attempts here are to
describe a small, low-powered cluster I put together from other
people's works. If you need serious enterprise-capable clustering, go
take a look at Lustre and see if it is better suited for your work.
- Simon Horman: Simon is the author of heartbeat, the package which
contains some of the nifty utilities used in this cluster.
- The entire Linux kernel community: Without you all, none of this
would be possible. Your work is of the utmost importance.
In recent years, there has been a paradigm shift from the uage of
large, expensive, proprietary systems to clusters of smaller, cheaper,
open components. Industries which relied upon large systems,
such as HP N-class, IBM z-Series, or Sun Enterprise servers are moving
from the concept of single, large systems (often the size of a
household refrigerator, or group of refrigerators) that handle the
workloads of numerous processes, to the concept of multi-node
clusters, say 2 nodes to 128 or even more, to share workload and
storage.
This type of shift from big and proprietary to small and open is not
new. Since the days of Thomas Edison, electrical and electronic
devices have gotten smaller, cheaper, faster, and better. The shift
in the 1980's was from mainframe IBM machines to smaller, faster DEC
VAX machines. The shift in the 1990's was from DEC VAX to UNIX-based
servers and workstations. Today, the proprietary UNIXes are giving
way to free Linux distributions, which draw from the best parts of the
UNIX tradition. All previous server-quality systems had high price
tags associated with them, and could maintain that level of pricing
because only larger companies required the help of computers to
transact business. But now, computing is essential for everyone from
the corporate megalith down to the mom-and-pop-shop corner business.
Today, Linux has emerged as the server platform of choice for many
businesses of all sizes, largely due to two fundamental properties of
Linux: price and stability.
However, no matter how robust the software and operating system, the
Second Law of Thermodynamics takes its toll on all Creation:
everything wears out and breaks down, no exceptions.
Motherboards, hard drives, CPUs, memory chips, everything has a
lifetime. In this article, I discuss the specifics of a cluster
design which minimizes the reliance on any single point of a system,
and allows users on a network to realize uninterrupted service for a
variety of applications including filesharing, web serving,
application hosting, network firewalls, database connectivity, and
almost any other service.
The hardware components used in this type of cluster are generally
cheap and easily available. The minimal hardware required to setup a
usable demonstration of this cluster is as follows:
- A pair of identical servers.
The level of "identicalness" is one only of convenience for the
administrator. I find that using exactly identical hardware, down to
board revision level and firmware level, makes life much easier when
administrating the servers. The only thing to keep in mind is that
because you are duplicating data between partitions on two different
machines, special care must be taken to ensure that both partitions
are the same size, down to the exact number of blocks. If they are
not, it is possible to force the cluster to use the smaller-size of
the two partitions as the "actual" size. The bottom line is, yes it
is possible to have wildly disparate machines as the cluster with only
a few points of commonality, but I strongly advise you to just bite
the bullet and get identical machines with identical hardware. The
best way to do this is to build them yourself, as you can personally
inspect the hardware to ensure same make, model number, board
revision, and firmware are the same. If you are leery about doing
this yourself, I offer reasonably
priced clusters which are configured specifically for this
purpose. As this is not necessarily an ad for my business, you may
also contact
me for more info if desired.
For this article, I will suggest that you have a physically separate
disk or disk array for data sharing. You can do data sharing on the
same disk as your system disk, but for the sake of this article, lets
say /dev/sda is your boot and system disk and /dev/sdb is entirely
devoted to data sharing between nodes of the cluster. In your
particular installation, you can use whatever you want. I've even
done clusters where the data lived on software RAID partitions in an
external JBOD rack shelf. What exactly you do with the shared
parition(s) will be determined by your budget and performance needs.
- Network cards, cabling, and a switch or crossover cable to connect
the two servers.
I recommend gigabit ethernet for this, although any network connection
could, technically, work. The necessity of gigabit or not depends
upon your usage and needs. Most modern servers and workstations come
with either 10/100 or gigabit already built into the motherboard. If
you want to use gigabit and don't already have it, the Intel MT 1000
cards can be picked up for somewhere in the neighborhood of $60, and
they work well in either 33 or 66 MHz PCI slots. Unless you're on the
strictest of budgets, $60 for a gigabit card is a wise purchase.
Intel also makes more expensive 66MHz PCI versions of the same gigabit
card in the $100-$120 range. You could even try out cheaper Netgear
or similar gigabit cards, which are down in the $40-$50 price range.
I have generally obtained best results with the roughly-$60 Intel Pro
1000 MT card, using Intel's own open-source drivers for it. The
drivers in the canonical Linux kernel are older version of the same as
the ones Intel has on their website, and drivers downloaded directly
from Intel will be newer. Last I checked, Intel had frequent
revisions and development on their gigabit driver, so you will
probably get better performance and reliability from those drivers.
The purpose of the card is to transport the data from the master to
the slave during data replication and synchronization, as well as to
shuffle process load to be shared between the machines. Something to
keep in mind is the speed of each related component in the system.
The speed of data replication and process sharing is a function of how
well-matched all the components of the entire system are. While it's
possible to build a reasonably effective cluster from PC's you pick up
from Wal-Mart or Dell for $499, you won't be reaching the full
potential of what could be done with the equipment available for not
much more cost.
The data replication process between the nodes of the cluster
exercises most of the system other than the CPU and RAM. It's
a good way to find out, really, what kind of system you have.
Manufacturers love to advertise the speed of the CPU and the RAM, but
these are the easiest things to boost the speed on, and generally
don't participate very much in the data replication process. The more
important numbers are the bus and disk speeds, which are frequently
not advertised, and usually quite misunderstood. I will outline the
roles of each here:
- CPU: The speed of the CPU will really only participate in
the data replication process if software RAID is being used. In
general, I recommend hardware RAID, although I have made serveral
systems with software RAID that had acceptable performance. Modern
CPU speeds are so vastly high in comparison to the rest of the
computer that it spends little time doing real work. Even when given
full IO loads, most modern CPUs are really doing almost nothing of
what they *could* be doing.
- RAM: The speed of the RAM is also vastly faster than
anything the disk subsystem or IO bus can produce. Like the CPU
speed, RAM speed usually makes little to no difference to the rate of
data replication.
- Bus speed: This is an important factor to the overall
performance of the data replication. If within budgetary limits, a
faster PCI bus should be used to maximize the capabilities of your
cluster. This will affect the efficiency of the gigabit card as well
as the disk subsystem.
- Disk and Controller speed: This is probably the most
important piece of the equation. The bus is faster than the disk
speed, but not by the same size margin as the CPU & RAM are. Maximal
throughput across the bus can only be realized with a fast disk
controller and comparably fast disks, arranged in a RAID for maximal
parallelization of IO. Some will automatically assume this means SCSI
RAID, but this is not necessarily true. Given an infinite budget,
perhaps SCSI RAID isn't a bad idea, but most of us have budgetary
limits. Entry-level ATA RAID and SATA RAID cards such as the 3Ware,
Promise, and LSI cards are incredibly cheap by comparison to most SCSI
RAID cards, and can provide performance in the same range as the
expensive SCSI RAID cards.
A note about disk performance: no matter what kind of disk you buy,
whether ATA, SATA, or SCSI, the sustained throughput of almost
any disk on the market today is limited to about 65 MB/s.
"What?" you say, "I just spent a mint on a 320MB/s SCSI RAID
controller for each cluster node!". Yup, that's right. That 320MB/s
only applies to the data transfer from the buffer on the controller
card to the cache on the disk itself. The actual transfer of data
from the cache of the disk to its physical platters is the limiting
factor. To my knowledge, and someone please correct me if I'm wrong,
but after reading the detailed spec sheets on numerous Western
Digital, Seagate, and Hitachi/IBM drives, I believe the highest
sustained throughput rate was on the order of about 65 MB/s. The
320MB/s SCSI controller is advertising only the burst rate that is
possible until the cache of the drive fills up. Once cache memory on
the drive is full (which is usually 2MB, 8MB or sometimes even 16MB on
the best drives) the data can only be processed at the throughput
speed of the cache-to-platter speed. However, don't fret. Using the
proper RAID configuration can basically remove this limitation. On
many controllers, using RAID 0, 1, or 0+1 will provide faster
performance than using RAID 5 because a RAID 5 configuration requires
on-the-fly checksumming routines to be processed by the RAID
controller's CPU. Software RAID-5 calculations are done in the MMX
high-speed registers of Pentium-class CPU's. RAID 0, 1, or 0+1 allows
for parallelizing of read and write operations between multiple disks.
Disk blocks can be handled in parallel between each disk, which
multiplies that 65MB/s limitation. Data is still flowing in and out
of each disk at 65MB/s, but if you have several disks striped
together, that 65MB/s can be doubled. At about 133MB/s, bus speed
becomes a factor for regular PCI. For the purpose of this paper,
we'll just assume you have a typical PCI bus, and nothing overly
"special".
It should be no surprise that if you want to push the limits on speed,
just know that ultimately, the specifics of how fast you'll really be
able to go with the data replication within the cluster will depend on
the entire assembly of gigabit card, disk controller card, RAID
layout, PCI bus, and disks you choose. If budget limits are stopping
you from getting the very best and fastest in hardware, do not worry.
You can still have quite a nice cluster without tremendous expense
using basically commodity PC parts you should find at almost any
retailer or online parts store. If you're just building this for fun
or for a small business cluster, having even just 30MB/sec of
replication throughput is still respectable. You couldn't even get
that level of redundancy and performance from IBM, Sun, or HP without
spending about 5 digits, so do not be overly concerned if your cluster
works but isn't all that fast.
If you're on a limited budget, you're in luck. Not a single byte of
the software in this cluster costs anything. At least not for just
using it: if you want support from someone on a particular piece, you
have to pay, but otherwise it's all GPL'd software. MOSIX's license
isn't 100% GPL. If you are a GPL zealot, you will want to check out
openMosix and
compare it. It has some neat features to it, but this cluster is
using regular MOSIX. I'm sure it could be done with openMosix as
well, but I haven't had the time to do so.
This cluster is not specific to a particular distribution. I have
tested it on Red Hat 7.3 and Red Hat Enterprise Server 3.0, however it
should work on virtually any Linux distribution. My cluster manager
is all written in Perl, and I had to tweak some of the calls for
forking when switching from RH 7.3's Perl 5.6 to RH ES's Perl 5.8.
I'm sure a much more skilled programmer could generalize the calls in
my cluster manager to work right regardless of Perl flavor. As of
this writing, it works well with Perl 5.8. You will also need to do
some kernel configuration and recompilation, so if you're squeamish
about doing that, this is the perfect opportunity to learn it and get
comfortable with it.
Here is the list of packages you will want to download:
- MOSIX: As of
this writing, the latest version of MOSIX is 1.10.1, for kernel
2.4.22. You may wish to download the MOSIX RPM file, however I will
cover this installation from the perspective of someone wishing to do
all the compilation themselves. RPM files may or may not be available
for your particular distribution, so compiling from source gives you
the opportunity to build it and have it work regardless of the
distribution you are using. MOSIX is the part of the cluster which
does the process-load sharing, as well as being an integral part of
maintaining contact with the other machine(s) in the cluster and
keeping track of which is up and which is down.
- Linux Kernel:
You will need to make sure the version of MOSIX you downloaded above
matches the version of the kernel. The MOSIX patches are specific to
kernel version, so make sure you get the right one. MOSIX patches are
always a few versions behind the latest kernel version, so make sure
you get the right one.
- DRBD: DRBD is
the piece of the cluster which will be doing the data replication. As
of this writing, the most current version of DRBD is 0.6.10. I have
built clusters with 0.6.5, 0.6.6, and 0.6.10, and DRBD just keeps
getting better. DRBD is a block device driver which sits at a layer
above the usual block devices. For example, a typical disk layout on
an IDE-based machine will have /dev/hda as the main system disk.
Partitions will be /dev/hda1, /dev/hda2, /dev/hda5, etc. Normally,
writes would occur to those partitions directly in a non-clustered
system. In a cluster with a shared partition, however, the block
device is the network block device, /dev/nb0, which is mapped to
"lower" block device. In drbd, you tell DRBD to map one of its
devices, /dev/nb0, to a physical device, say /dev/hdb1, as well as to
a remote network address. When writes occur on the master node to
/dev/nb0, DRBD writes the data to the local, physical device
/dev/hdb1, and copies the data to the specified remote network
connection, where the slave node writes it to its own local device.
DRBD is elegant and well-maintained, and commercial support contracts
are available for reasonable rates from Linbit, if desired.
- Heartbeat: Although I'm not specifically using the
heartbeat software in this cluster, heartbeat has some nice utilities
in it for IP address takeover.
- TKCluster: This is my own cluster manager. It's a
pile of Perl scripts which talk to each other within the cluster. On
the master node, it makes sure that DRBD maintains connectivity with
the slave. If the slave dies and the master node goes into a
StandAlone mode, it instructs DRBD to go back into Primary mode and
wait for connection from the slave. It also attempts to force a
reboot of the master node should the slave node detect failure of the
master, and resets DRBD to push the slave into master mode, and
restarts necessary services. Basically, it glues all the other above
packages into a more unified system.
At this point, you should have a pair of servers with all the above
software downloaded. I will write the rest of this assuming you have
basically identical hardware. More experienced Linux administrators
can play around with having disparate hardware, and it's quite
possible to do that, but I won't go into all the details of
maintaining different drivers and platforms.
I will assume you have /dev/sda as your boot and system drive,
/dev/sdb1 is the partition that will be replicated between the
servers, eth0 will be your connection to your 10/100 LAN that the rest
of the office lives on, and eth1 will be your gigabit connection via
crossover cable between the two servers. Your LAN on eth0 will be
192.168.0.* with a 24-bit netmask of 255.255.255.0. The gigabit
connection will be 192.168.1.*, also with a 24-bit netmask. Lets make
it easy to remember and have server1 set up with 192.168.0.1 on eth0
and 192.168.1.1 on eth1. Likewise, server2 should be brought up with
192.168.0.2 and 192.168.1.2.
The first thing to do is bring up your favorite Linux distribution on
what I will call "server1". Make sure you install the development
tools for your distribution, including header files for kernel
development. We will setup server1 to be the master node of the
cluster, and the partition it replicates to the slave, server2, will
be used for NFS and/or Samba filesharing. Any number of other types
of services, including web, email, DNS, Postgres, MySQL, or any number
of other server, could be set up on this partition as well. Also,
DRBD comes ready for sharing 2 devices when you compile it, however
this can be expanded to as many as 256 devices by changing the value
of minor_count in drbd/drbd_main.c of the DRBD source package.
I have used DRBD for sharing up to 16 devices between servers, and
this small change to drbd_main.c is easily done.
The very first thing to do is install the kernel source and if you're
not familiar with the process of compiling it, take some time and
compile yourself up a kernel. If you are new to this, don't be afraid
to try it on your own. There's an abundant supply of documentation on
how to do this all over the web, and the details you'll need for your
particular server are outside the scope of this document. Currently,
you would want the file linux-2.4.22.tar.gz. Download this to your home
directory and untar it in /usr/src and start up the configuration
script:
# cd /usr/src
# tar xvfz ~/linux-2.4.22.tar.gz
... <tar output here>...
# cd linux-2.4.22
# make menuconfig
If you have X installed and running, you can replace "make menuconfig"
with "make xconfig" above. If you have neither X nor a modern
terminal and are stranded on a terminal which can't do cursor
positioning with the NCurses calls found in "menuconfig", you can
always fall back to "make config", which is just line-by-line
configuration of every option in the kernel. I've done it,
it's painful, you'd much rather configure your kernel with menuconfig
or xconfig.
My only "tips" for this are few. Make sure you get the canonical
kernel source and not just the distribution-packaged source. Although
distributions carry the full kernel source packaged up with them, I
tend to not trust those packages for use with MOSIX. Not for any
other reason than that the distribution-packaged kernels tend to have
extra, non-canonical source added in. This is good for the vendor and
good for the end-user because sometimes hardware which would otherwise
have been unrecognizable will work with those extra patches applied.
The vendor of the distribution has something to boast over, and the
end user gets to use their new gizmo, which would have been unusable
if plain canonical source had been used. Usually, the patches for
that hardware will make it into the canonical kernel eventually. The
problem is that MOSIX patches may not apply properly to kernel sources
that have non-canonical patches already applied. Once you have
successfully compiled and booted your own kernel a few times with
different options and have familiarized yourself with how your
bootloader (probably either GRUB or LILO) works, you're probably ready
to go further. You will probably want to install the source under
/usr/src/, as this is where MOSIX looks for source by default.
Lastly, make sure the the option for "Network Block Device" is set to
be a module. You do not want it to be compiled into the kernel or the
DRBD module will not be able to acquire the /dev/nb* logical devices.
For individuals who are very new to kernel compilation, one pitfall
that catches many unaware is the initrd image. When you compile your
kernel to have lots of modules in it, you'll need to list them in
/etc/modules.conf. Especially, modules critical to booting (ie disk
controllers) will have to go into there. When your new kernel boots,
it will expect a matching initrd image, which is a boot-time RAM disk
that is loaded with the drivers you need to boot up. This can be
somewhat complex, so I won't go into the long and detailed description
here, but just beware that this is a potential pitfall to
kernel-compilation-newbies. In fact, I've been compiling kernels
myself since about 1994 or 1995 and I still forget about the initrd
sometimes, only to find my kernel demanding drivers it can't find.
Alternatively to using modules, which are flexible by being able to
insert and remove them at will, you could also just not use modules at
all and build everything into the kernel itself so you don't
needmodules. There is technically nothing really wrong with
this, although this is less flexible, especially if you ever have to
add or change hardware, so I always recommend going the route of using
modules.
The next thing to do is to install MOSIX. Make sure you download both
the kernel patches and the userland utilities, because as of this
writing they are packaged separately. Currently, you would download
MOSKRN-1.10.1.tar.gz and MOSIX-1.10.1.tar.gz. Untar these and run the
installation script. Assuming you have the source tarballs in your
home directory, you will do something like this:
# tar xvfz ~/MOSKRN-1.10.1.tar.gz
... <tar output here>...
# tar xvfz ~/MOSIX-1.10.1.tar.gz
... <tar output here>...
# cd MOSIX-1.10.1
# ./install.mosix
Then answer the questions from the script. These are very simple
questions, like where the kernel source lives and whether you want to
watch the compilation happen or not. If you're unsure when it asks
you about the "MFS mount point", just say yes. It will take care of
the details for you.
MOSIX's install script will ask for your desired kernel config type
and bring it up. The only thing you'll need to worry about there is
to double-check to make sure that MOSIX process migration is turned
on. This allows processes running on one machine to transparently
migrate over to the other machine if the load on the first machine, or
home node, becomes too high. This is one of the coolest
features of MOSIX. No applications have to be recompiled, the user
will never see a difference in the execution or debugging of a
process. It is 100% transparent to the user.
The /proc pseudo-filesystem is integral to MOSIX, so you'll also
want to make sure that is turned on. When the kernel compilation tool
returns to the install script, the script checks to make sure that the
critical pieces of MOSIX are turned on, and will ask to make sure if
it finds something amiss.
If you've already been successful at compiling your own kernel, you
should feel confident once the compilation completes that MOSIX will
work for you.
Once the compile completes, my natural temptation is to immediately
bounce the server and try to boot that new kernel. However, in this
case, I would encourage some file editing before rebooting. MOSIX
uses /etc/hosts and /etc/mosix.map to figure out who it is in the
cluster and where others in the cluster live. In our example, we want
to make sure we have the following:
In /etc/hosts:
127.0.0.1 localhost localhost.localdomain
192.168.0.1 server1 server1.mydomain
192.168.1.1 server1p
192.168.0.2 server2 server2.mydomain
192.168.1.2 server2p
In /etc/mosix.map
1 192.168.1.1 2
The change that might surprise you in /etc/hosts is 192.168.1.1 being
mapped to "server1p". The "p" on the end stands for "private". Same
for 192.168.1.2 pointing to "server2p". Each of the two servers will
talk amongst themselves as "server1p" and "server2p", while talking to
the rest of the office network as "server1" and "server2". If you
have a DNS server in your office, make sure these changes go in.
The format of /etc/mosix.map is: the first field is a unique MOSIX
number, the second field is the starting address, and the third field
is the number of nodes. In the above /etc/mosix.map, this means "I
belong to a MOSIX cluster which starts numbering at MOSIX id # 1,
starting with 192.168.1.1, and containing 2 nodes." The same
/etc/mosix.map file will be used on both nodes, and when MOSIX looks
up the address in /etc/hosts of the machine it is running on, it
automatically recognizes which MOSIX ID number it is. When MOSIX runs
on the second node with the same /etc/mosix.map, it will see that it
has address 192.168.1.2, and must therefore be MOSIX ID #2. Quite
elegant and effective. Also, if you're going to do something
complicated with multiple MOSIX nodes on different subnets, that's
possible just by adding another line to /etc/mosix.map. And, if you
choose to have a multi-homed MOSIX node where MOSIX is allowed to talk
out both interfaces, you can also use the ALIAS keyword in
/etc/mosix.map to denote an alternative address for a node. Details
on this are outside the scope of this document, but do keep in mind
it's possible to accomplish.
At this point you should have an updated kernel, modified /etc/hosts,
and a new /etc/mosix.map. Again, the temptation is to boot it and see
how it works. But not just yet...
DRBD should be downloaded and installed next. It is kernel-specific,
and must be compiled for a given kernel. This will not participate in
the computational cluster, but because we already have a kernel built
(and it probably works) we might just want to get DRBD built and out
of the way now.
The current DRBD file is drbd-0.6.10.tar.gz. Untarring and building it
will go something like this:
# tar xvfz ~/drbd-0.6.10.tar.gz
... <tar output here>...
# cd drbd-0.6.10
# make
... <make output here>...
# make install
... <make output here>...
The last step of "make install" should place a copy of drbd.o into
your module directory for this kernel. DRBD is surprisingly small, so
the build won't take very long. Other utilities for using DRBD will
be installed under /usr/bin and/or /usr/sbin.
Now, double-check and make sure you have ALL the previouly-listed
packages for this cluster downloaded and on your system disk. If so,
go ahead and do a reboot and bring up your new kernel!
If all went as planned, you now have a brand new kernel running, and
MOSIX is listening and waiting for the other node to come up. You can
see what MOSIX is thinking by doing a "mon" command. The output will
be a screen that shows the configured nodes across the bottom, and the
loads on each. In this case, you will see numbers 1 and 2 briefly
appear on the bottom of the screen, and then when it discovers that 2
is not yet up, 2 will disappear leaving only 1. If you open up
another window and do something that is computationally-intensive or
IO-intensive, and then watch the "mon" window, you should see a bar
rising above the "1" to indicate load.
If you've gotten this far, you're but minutes away from having a
working compute cluster. Only minutes? Of course! All you have to
do is duplicate the work you've already done on server1 using "dd"
onto the other server's disk, and then you'll be ready!
Of course, if you haven't gotten this far successfully, go back and
check to make sure you have booted the right kernel, make sure you
MOSIX kernel build worked properly without errors, and just overall
check and make sure everything succeeded the way it was supposed to.
If you're a kernel-compilation-newbie, make sure you have all the
drivers for all your hardware configured in properly. Check out the
websites, newsgroups, and mailing lists for the kernel, MOSIX, and
DRBD to see if you have done things right and if maybe there's a bug
in one of the packages. When you've booted MOSIX successfully and you
can run "mon" successfully and see the load, you're ready to move on.
Now, back to duplicating that system disk. It's quite simple. Bring
down server1 and pop in the disk you would have used as the system
disk for server2. If this is a SCSI machine, make sure you change the
SCSI ID to point to an unused ID. If this is an IDE machine, make
sure you have set the master/slave jumpers appropriately. If you are
using a RAID array as your system disk, you may or may not be able to
attach your group of disks to the controller as a secondary logical
group. That depends entirely upon the controller.
If you are using SCSI, IDE, or a compatible RAID controller that can
handle it, boot up server1 with the system disk(s) from server2
attached as a secondary. Login as root. In this example case, I am
using /dev/sda as the system disk, /dev/sdb as the shared disk, and so
I attach the system disk from server2 as /dev/sdc. (I could have also
yanked /dev/sdb and replaced it with server2's system disk, you could
do that if you are short on space, power connectors, or whatever.) To
duplicate server1's system disk (/dev/sda) over to server2's system
disk (/dev/sdc), I do this at the prompt:
# dd if=/dev/sda of=/dev/sdc
...and then wait. If you're not on a RAID 0, 1, or 0+1 system, keep
in mind that maximum throughput rate of about 65MB/second while you
wait.
Please note that "dd" is a powerful tool. If you mix up the
"if" and "of" designations ("if" stands for "input file" and "of"
stands for "output file"), you WILL destroy all
your hard work of installation thus far. Read the man/info page for
more details about "dd", but just make sure you don't mix things up.
It's so heartbreaking to issue a "dd" on something you worked
so hard on, only watch it evaporate into oblivion when you
thought it was being duplicated. Trust me, I know.
Just for giggles, you might want to put the command "time" before
"dd", such as "time dd if=...". This will give you an idea of the
speed of the disk's raw throughput if you don't already know.
Likewise, the "hdparm" tool is a fun item that can give you
performance numbers as well. Try "hdparm -Tt /dev/sda" on an idle
machine and see what you get. The numbers may very well surprise you.
The man/info page for hdparm has more details.
You might be able to improve the performance of "dd" a bit by passing
it the "bs" parameter, which specifies block size. By specifying a
larger block size, you can optimize the read and write operations.
The "bs" parameter also takes a suffix like "k" or "M" to give
multiples of kilo or megabytes. If you're on a system with say 512M,
you could give append "bs=256M" on the end of the dd command and get
better performance than without it.
When the dd is done, bring the machine down, put the disk back into
server2, and boot up server2. Leave server1 down for the moment. If
all went well with the dd, server2 will come up thinking it is
server1. All you have to do is change it's IP address and hostname
and you should be set. Each distribution can be different in how it
sets hostname and IP, so you'll need to figure that out for your
particular distribution, but I can tell you that Red Hat systems will
want to have stuff found in /etc/sysconfig modified. On Red Hat 7.3,
8.0, and 9, you will want to modify the following files:
/etc/sysconfig/network-scripts/ifcfg-eth0
/etc/sysconfig/network-scripts/ifcfg-eth1
/etc/sysconfig/network
Again, YMMV from one distribution to another. Once finished, a reboot
of server2 should bring it up with eth0 as 192.168.0.2, eth1 as
192.168.1.2, and hostname "server2". If this is true, boot up
server1 and we'll see our first "real" demo of MOSIX in action.
Over on server1, login and start up "mon". You should now have both
numbers "1" and "2" showing on the bottom of the screen, meaning that
MOSIX sees both machines as being members of the cluster. Leave "mon"
running here while you go over to server2 and login as root. Do
something like a recursive directory listing of the whole system, "cd
/ ; ls -aglR". Watch the "mon" screen on server1, and you'll see the
load come up on server2. This verifies that server1 knows about what
server2 is doing. While this is running, open another terminal on
server2 and issue a similar command, only this time force it to
execute on server1 with the "runon" command:
# cd / ; runon 1 ls -aglR
You'll see that even though you are typing in the commands on server2,
the "runon" command forced server1 to start executing the recursive
directory listing, using the contents of server2's hard drive. You
should see the activity light on the gigabit card stay on solid, or at
least blinking wildly fast. Now, directory listings are not all that
interesting, however if you had a computationally-intensive process
running, MOSIX would automatically recognize this and internally do a
"runon" and shuffle that process over to the other machine
transparently. I do not have a decent example to give you to
do this, but with a little imagination I'm sure you can come up with
something. My first few tests of MOSIX and my demonstration of it for
a company I worked for when I built my first MOSIX cluster in 1998 or
so involved using it to call a "cracklib"-like homebrewed program for
checking crackability of passwords on our servers. The crypt()
function is purely mathemtical with no IO, you might try using a
program to call crypt() continuously and watch the execution migrate
as a purely computational migration. It's really neat to watch.
Note that MOSIX also gives you some performance-tuning tools.
Using these, you can tune MOSIX to be responsive to your specific
speed of network card. In this case, we have just a plain gigabit
connection, which although it's fast for ethernet, isn't really all
that fast compared with proprietary networking technologies like
Myrinet. Myrinet cards start off at sustained throughputs of about
250MB/sec, with much lower latency than anything in the ethernet
world, and go up from there. Likewise, the price of Myrinet reflects
this fact. If you're really interested, MOSIX allows you to fine-tune
your installation to the cards you have whether you have plain 10/100
or Myrinet, but because this installation is using $60 gigabit cards
and not $1100 Myrinet cards, you won't need to worry about it until
someone hands you a budget that can accomodate those kinds of
expenses.
If all you wanted was to brag to your friends or put something on your
resume that says "Built Linux Cluster", you could quit here and
legitimately call it a success. You did it. What you have is, truly,
a computational cluster. You could buy up a zillion other PC's just
like these, lay down a copy of the kernel you just built in each one,
and use them to predict the weather, calculate prime numbers, find the
next digit of Pi, or play chess. In fact, the only change you'd have
to make is to change the /etc/mosix.map to say it has some "X" nodes
starting at 192.168.1.1 instead of just 2, set the IP address on each
one, and away you go. You now have the very embryo of a major
supercomputer, all thanks to the fine gentlemen in Jerusalem and their
brilliance and hard work. It measures "cool" on the geekometer. But
not quite way cool until you tap into doing more with it than
just being a rocket-fast Quake engine. Next stop: high-availability
data!
If you have not already done so in the setup of the machines, fdisk
and format the device you want to share between them. In this case,
that is /dev/sdb. You only need to do this once, on server1 the
partition /dev/sdb1 will be mastered onto server2 by DRBD. Neat, eh?
All you have to do is prepare one disk, and the other will be handled
automatically. This is, of course, assuming identical disks. The
most critical part of the "identicalness" of the disks is the number
of blocks. You can mix disparate disks together, but for the purpose
of thie article, everything is fully identical between server1 and
server2.
To partition and format /dev/sdb, run "fdisk /dev/sdb". Below is the
dialog between me and fdisk, with my input shown in bold italics. The
disk being worked on is a 9G SCSI.
# fdisk /dev/sdb
The number of cylinders for this disk is set to 1106.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): p
Disk /dev/sdb: 9100 MB, 9100044288 bytes
255 heads, 63 sectors/track, 1106 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-1106, default 1): 1
Last cylinder or +size or +sizeM or +sizeK (1-1106, default 1106): 1106
Command (m for help): p
Disk /dev/sdb: 9100 MB, 9100044288 bytes
255 heads, 63 sectors/track, 1106 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 1 1106 8883913+ 83 Linux
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
Once you have done this, you're ready to format the disk. For most
general purpose usage, the ext3 filesystem is ideal, however you may
choose any filesystem that meets your needs. ReiserFS is ideally
suited to large mail and news servers where there are many thousands
of files in a single directory. XFS and JFS are also good choices as
journalling filesystems. Unlike other operating systems which may
only support just one type of journalling filesystem, Linux supports
several journalling filesystems to choose from. In this case, we will
format /dev/sdb1 with ext3. Use the following command:
# mke2fs -j /dev/sdb1
The -j parameter means to turn on journalling. The ext3 fliesystem is
backward compatible with ext2, hence the need to distinguish between
plain, unjournalled ext2 and journalled ext3 with the same executable
to format either.
When this completes, you will have a partition that could be mounted
and files placed on it. I would recommend that you do so, and place
some files on there so that you can better see the replication occur.
Once you have copied some files onto there, unmount the partition. It
will be remounted under DRBD later.
Now, the next step will be to verify that DRBD is installed and
working correctly. To do this, execute:
# modprobe drbd
This should complete without any output or errors. Then, to see what
modules are installed, execute:
# lsmod
In the output, you should see "drbd" in the list of installed modules.
You can see what DRBD is thinking by inspecting the contents of
/proc/drbd. If you're not familiar with /proc, it's a
pseudo-filesystem (not really files on the local disk) that is a sort
of viewport into the running kernel. You can treat entries under
/proc as if they were real files and directories, and use "cat
/proc/drbd" to print out what the DRBD driver is currently thinking.
If you have modified the file drbd_main.c in the DRBD package
and modified minor_count with a larger number of devices, you
will get more output, but otherwise you will get just these two devices:
# cat /proc/drbd
version: 0.6.10 (api:64/proto:62)
0: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
Now, we have to tell DRBD that you wish to use the device /dev/sdb1 on
server1 as the source drive, and /dev/sdb1 on server2 as the
destination drive. The way we do this is to with the "drbdsetup"
utility that comes with drbd. It will install as
/usr/sbin/drbdsetup when you install it with "make install".
You may see all the options "drbdsetup" allows by simply executing
drbdsetup without any arguments:
# drbdsetup
USAGE:
drbdsetup device command [ command_args ] [ comand_options ]
Commands:
primary [-h|--human]
secondary
secondary_remote
wait_sync [-t|--time val]
wait_connect [-t|--time val]
replicate
syncer --min val --max val
down
net local_addr[:port] remote_addr[:port] protocol
[-t|--timeout val]
[-r|--sync-rate|--sync-max val] [--sync-min val]
[--sync-nice val] [-k|--skip-sync]
[-s|--tl-size val] [-c|--connect-int] [-i|--ping-int]
[-S|--sndbuf-size val]
disk lower_device [-d|--disk-size val] [-p|--do-panic]
disconnect
show
Version: 0.6.10 (api:64)
First, tell DRBD to map the physical device /dev/sdb1 to /dev/nb0
using "drbdsetup":
# drbdsetup /dev/nb0 disk /dev/sdb1
The next thing to do is tell DRBD that you wish to make /dev/nb0
replicate to another node on the network. The default port number for
drbd to talk on is 7788. If you're not familiar with port numbers and
what they mean, you might like to read over the Linux Networking Overview document or one of the
excellent books on networking available from O'Reilly.
At this point, if you would like to access the files you copied on to
/dev/sdb1, you need to access them through the block device /dev/nb0.
Because of the above drbdsetup command, DRBD will be managing the
writes to /dev/sdb1 through /dev/nb0. You can mount it again as
/dev/nb0 and verify that you can still see those files you placed on
there back when you initially formatted it. In this case, mount it to
/mnt/testarea with the command "mount /dev/nb0
/mnt/testarea". If /mnt/testarea does not exist before you
mount it, execute "mkdir /mnt/testarea" first and then
mount it. Make sure you do this ONLY on server1. Server2 must NOT
attempt to access /dev/nb0 except with DRBD itself.
# drbdsetup /dev/nb0 net 192.168.1.1:7788 192.168.1.2:7788 C
# drbdsetup /dev/nb0 primary
The parameter "192.168.1.1:7788" above sets the local address on which
to bind drbd. The next parameter, "192.168.1.2:7788" instructs drbd
to expect the other half of the mirror at that address on the other
machine. The final parameter, "C", means to use "protocol C" of drbd
to transfer the data. You may try other protocols (A and B are also
available) and read the documentation on DRBD to find out which
protocol is right for your needs. In my experience with DRBD thus
far, protocol "C" has afforded me the highest throughput between the
two machines. The second command line forces this to be the "primary"
node, the source from which server2's /dev/sdb1 will be mastered.
Once you have executed the above drbdsetup commands, you should have
output that looks like the following:
version: 0.6.10 (api:64/proto:62)
0: cs:WFConnection st:Primary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
You can see that the first device has changed and shows "WFConnection".
This means "Wait For Connection". It's waiting for server2 to start
talking back to it. So, we now move to server2, boot it up and issue
similar drbdsetup commands:
# modprobe drbd
# drbdsetup /dev/nb0 disk /dev/sdb1
# drbdsetup /dev/nb0 net 192.168.1.2:7788 192.168.1.1:7788 C
Now, find out what DRBD is thinking:
# cat /proc/drbd
version: 0.6.10 (api:64/proto:62)
0: cs:SyncingAll st:Secondary/Primary ns:0 nr:960 dw:960 dr:0 pe:0 ua:0
[>...................] sync'ed: 0.1% (8674/8675)M
finish: 7:25:08h speed: 318 (318) K/sec
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
Congratulations, you now have data flowing from server1's /dev/sdb1
over to server2's /dev/sdb1! It is currently replicating data over
the eth1 network cards.
However, take a look at the "finish" and "speed" fields. The speed is
only about 300KB/sec, and it's going to take over 7 hours complete.
This is much too slow to be useful. DRBD's default rate is 250KB/sec,
however this can be bumped up to the physical limits of the hardware
involved. DRBD's hard limit at the time of this writing is 600MB/sec,
however to my knowledge there are no commercially available storage
arrays that will give you this kind of sustained performance, so the
limitation of 600MB/sec is not really a limitation.
The drbdsetup program can be used to set the sync rate. It's
important to note that the "primary" node (server1) determines the
sync rate. The following command will do nothing if run on the
"secondary" (server2). The sync rate value can be set to the maximum
with the following command line (note that these numbers are in units
of KB):
# drbdsetup /dev/nb0 syncer --min 599999 --max 600000
Now, take a look at the contents of /proc/drbd:
# cat /proc/drbd
version: 0.6.10 (api:64/proto:62)
0: cs:SyncingAll st:Primary/Secondary ns:1451064 nr:0 dw:0 dr:1451100 pe:1129 ua:0
[===>................] sync'ed: 16.4% (7258/8675)M
finish: 0:04:14h speed: 38,121 (6,877) K/sec
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
You can see two important changes. First, the sync rate is now much
higher. In fact, most likely, it approaches the maximum sustained
throughput of your chosen hardware. Secondly, the time to resync has
dropped dramatically. In the "speed" field are two numbers. The
first number represents the instantaneous speed of the data being
replicated at that instant. The second, parenthesized number is the
average over the duration of the synchronization. Here, the
instantaneous speed is about 38MB/sec, while the average is under
7MB/sec because the earlier period of 250KB/sec is averaged in.
On server1, during the sync, you should test the replication a little
bit by creating a new directory or two under /mnt/testarea, and
copying some files into that area. Do this during the sync because it
will demonstrate how DRBD duplicates the data continuously and
transparently as it changes. As you are writing files and directories
to /mnt/testarea on server1, those write commands are being caught by
drbd and sent over to server2 where those write commands are
duplicated.
I would now recommend waiting for the two nodes to complete their
synchronization. The time involved will depend on the size of the
disk to be replicated and the speed of the hardware you have in each
node. Ponder for a moment the numbers you see as the sync rate, as
well as the numbers reported by hdparm, if you ran that before. If
you have the hardware and time to do so, I would encourage you to
rebuild this exact same cluster on machines of various hardware
platforms. Compare a pair of $700 desktop PC's with this
configuration, a pair of $5000+ "server" class machines, and some
older hardware. My guess is that you'll find a relatively small
difference in performance between the desktop and "server" class
hardware, but marked difference between the older and the newer
hardware. Keep in mind that no matter how expensive and sophisticated
your system, hard drives can only push a maximum of about 65MB/sec,
regardless of interface controller. If you are someone responsible
for hardware recommendation or purchase in your organization, this may
be the opportune time to scrutinize your requisitions, and you may be
able to realize future savings by not overspending unnecessarily on
hardware. Examine whether or not your server truly demands SCSI,
which realizes extra speed only in short write-bursts that do not
completely fill the drive's cache, or if you may be able to get along
far more cheaply with a SATA disk, or perhaps a cheap ATA/SATA RAID
controller. Because both plain SATA controllers and SATA RAID
controllers are only on the order of $50 each, and good SCSI or SCSI
RAID controllers are typically an order of magnitude more expensive, I
would strongly recommend testing them side-by-side.
You will know that the sync has completed when you see this kind of
output, showing that the status has changed from "SyncingAll" to
"Connected":
# cat /proc/drbd
version: 0.6.10 (api:64/proto:62)
0: cs:Connected st:Primary/Secondary ns:8883924 nr:0 dw:12 dr:8883933 pe:0 ua:0
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
Anyway, by the time the replication completes, you now have a pair of
systems with completely identical /dev/sdb1 partitions. Identical in
every way, down to the block level. At this point, you can bring down
drbd on each machine, and mount /dev/sdb1 and verify that the files
you copied onto /dev/sdb1 on server1 are now found on both server1 and
server2. To bring DRBD completely down and mount it back up as a
"normal" block device:
# umount /dev/nb0 (do this only on server1)
# DRBD /dev/nb0 disconnect
# DRBD /dev/nb0 down
# rmmod drbd
# mount /dev/sdb1 /mnt/testarea
# ls -l /mnt/testarea
On either machine, your output should be identical, and both machines
should show the exact same data. If you are not yet at this point,
you might have incorrectly typed in the drbdsetup commands that set up
the replication, or perhaps there is a problem with your network
cabling, network cards, or something else. Make sure you can
establish connections between the machines over the gigabit cards
using either a crossover cable or gigabit switch. If you are
experiencing a situation where DRBD is not talking over your gigabit
cards, turn on some services such as ftp or telnet, and make sure you
can connect between them. Also, make sure your ipchains/iptables
configuration is such that port 7788 is open.
If you are now here, you have successfully replicated data from
server1 to server2. The files that are present on server1 are also
present on server2, in their entirety. If your goal was only to
achieve nearly-real-time replication of data, you could stop here.
You could bring up the servers in this fashion and leave them run
indefinitely. All writes to server1 would transparently and rapidly
be replicated on server2. We could call this a kind of "manual
cluster" (isn't that an oxymoron?), in that if server1 were to fail,
we could manually power off server1 and put server2 into primary mode,
and then manually change its IP addresses. The next section will
describe how to set this up for automated fail-over without manual
interaction.
Because we're using gigabit ethernet that is substantially faster than
most disk controllers, and provided the bus speeds are fast enough to
do so on both machines, we can assume that even full-throughput writes
to the disk of server1 are being replicated at the same speed over on
server2. If we were using 100Mbit instead of gigabit, we'd be limited
by a maximum of perhaps 12MB/sec, as that is theoretical maximum of
100Mbit ethernet. Instead, you should probably be realizing
throughput on the order of 20-55 MB/sec, depending on hardware. At
the time of this writing, I have experimented mostly with older
Pentium III systems, all of which have achieved 25-40 MB/sec on older
SCSI controllers and older disks. I would be extremely interested to
hear of your
experience if you have successfully set this up on more modern
hardware such as Pentium 4/Xeon or Opteron/Athlon64 systems with
faster bus speeds, faster controllers, and faster disks.
The final hurdle is establishing complete fail-over between the
servers. This means that when server1 dies or is taken down, server2
detects this change and follows a pre-defined course of action :
- Force server1 to reboot or physically turn off, if it's not fully
"out of the way" yet.
- Force DRBD to go into "primary" mode.
- If needed, run fsck on the shared DRBD filesystems.
- Mount the shared filesystems
- Takeover the IP address of server1.
- Start the services which access the data in the DRBD partitions.
To this end, I have written my own cluster manager, TKCluster, which (mostly) does just this. It has
some bugs, but I'm glad to share it with everyone to see if anyone out
there is interested in seeing it improve.
To install TKCluster, you will first need to install Heartbeat. Heartbeat also requires libnet
from The
Packet Factory. Also, depending on your distribution, you might
also have to install some other packages, but libnet is probably the
most important piece to have in order to compile Heartbeat.
The piece of Heartbeat I find most useful to use in TKCluster is the
IPaddr script. If you do the usual "./configure && make && make
install" of Heartbeat, this script is probably located in
/usr/local/etc/ha.d/resource.d/. If not, it might be in
/etc/ha.d/resource.d/. This script is used to takeover the IP address
of the failed machine by sending ARPs and causing the other machines
in the LAN to refresh their ARP caches with the new hardware address
for that IP.
Until this point, we have no discussed the method by which the data
would be highly available across two servers with different IP
addresses. Linux supports the concept of IP aliasing: putting more
than one IP address onto a single network card. To make our services
appear highly available to the rest of the LAN, we need to allocate an
address for other machines to access the cluster, which we will call
our "cluster address". Right now, server1 and server2 live on
192.168.0.1 and 192.168.0.2 respectively within the LAN. Let's
allocate 192.168.0.250 and configure up eth0 on server1 to hold that
address:
# ifconfig eth0:0 192.168.0.250
It's that simple! The interface "eth0:0" is an aliased interface: the
"real" interface is eth0, and allows for multiple IP's on the same
card. If any machine on the LAN attempts to connect to 192.168.0.250,
it will be talking to server1 directly. The reason this is necessary
is to be able to bring up each cluster node, bringing up their own
un-aliased addresses initially, and be accessible on the network with
their un-aliased addresses. When a decision is made by either a user
or the cluster manager for a node to sieze control of the "primary"
role, that machine will assume the cluster address (192.168.0.250) and
will send ARP packets by calling Heartbeat's IPaddr script. The
aliased address can be brought up and down without interrupting the
ability to connect to the server's non-aliased address. When services
such as Samba or NFS are brought up on the cluster, the clients on the
LAN will be told to point to the cluster address, 192.168.0.250. If
the primary cluster node goes down, the secondary will assume control
of that address. In an ideal fail-over condition, users on the LAN
who are accessing the cluster will notice a pause while accessing the
dead or dying server, and then when the secondary takes over and grabs
the address, the service will resume. They should notice
nothing more than a delay or pause as the secondary assumes the
primary role.
Once Heartbeat is installed, you will probably want to test it and
observe how it functions within the cluster environment. Bring down eth0:0 on server1:
# ifconfig eth0:0 down
Now, bring up the same address on server2 using the IPaddr script:
# /usr/local/etc/ha.d/resource.d/IPaddr 192.168.0.250 start
You will see output which shows it is sending a lot of ARPs on the
LAN. All the machines which had previously cached 192.168.0.250 and
mapped it to the hardware address of server1 will now have the cache
refreshed to show the hardware address of server2's eth0 card. IPaddr
does all the work of determining which interface it should be using
for the address so that the end user doesn't need to care about that.
You can bring down this interface safely in the same fashion it was
brought up:
# /usr/local/etc/ha.d/resource.d/IPaddr 192.168.0.250 stop
Note that if a machine goes down unexpectedly while it is holding an
aliased address that was brought up with IPaddr, there is a lockfile
in /usr/local/var/lib/heartbeat/rsctmp/IPaddr/ which is the same as
the alias name. I would recommend adding the following line to your
initialization scripts, such as /etc/rc.sysinit or /etc/rc.local.
These may or may not be the right scripts for you, depending on your
distribution. What you want is to modify or create an init script
which is executed after /usr/local/ is mounted (and before the cluster
manager is started, which will be covered shortly) that contains the
following:
rm -f /usr/local/var/lib/heartbeat/rsctmp/IPaddr/*
This will clear out all lockfiles from previous boots, so that the
cluster will always get eth0:0.
To install TKCluster, untar the package and copy the files to the
proper locations:
# cp storagecluster /etc/init.d/
# cp *.pm /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi/ (this location depends on your version of Perl and its @INC path)
# cp *.pl /usr/local/bin/
# cp cluster.conf /usr/local/etc/
It's now time to configure TKCluster so that it knows about your particular setup.
If you know Perl, you will immediately recognize the internals of
cluster.conf as Perl. This config file is meant to be called by
eval() from Perl. Yes, there are potential security concerns
with having a config file that is directly eval'd by Perl. A
malicious local user with write access to the file could put something
evil in there. This is good reason to set permissions so only root
can write to the file and not give out your root password to anyone.
Perhaps a future release will change this behavior so that there are
no security concerns.
The config file itself, cluster.conf, contains comments to
explain each piece. Please remember that TKCluster is still young and
in development, and so some of the configuration parameters may be
relics of previous, unreleased versions. However, it has a decent
start and can be customized easily. The cluster.conf found in the
package is already configured with reasonably sane defaults for this
article. Perhaps only paths might need to be modified to work on your
particular distribution, but it's also possible that the paths are
already right for you. Double-check and make sure all the paths are
set up right. With this config file, the cluster will attempt to
mount /dev/nb0 to /opt and start NFS sharing it. You may want to edit
/etc/exportfs to contain the following line, or a similar line based
on your own security policies:
/opt *(rw,no_root_squash,async)
Note that this line, as it stands, leaves this NFS mount point
"flapping in the breeze" on your LAN, so use this only for testing and
then restrict it further as needed.
If you wish to experiment with a clusters containing multiple DRBD
devices, all you will need to do is add extra lines to the hashes in
the config file which are used on a per-device basis.
TKCluster package has several files in it. The purpose of each of
these pieces is outlined here:
- storagecluster is the init script which starts the
clustering. It accepts "start" and "stop" parameters. During normal
operation, the cluster administrator will need to only run this with
the start command and the rest of the cluster is fired up
automatically.
- storaged.pl is the process which feeds info about the local
system to the other node when asked. It is also used to set certain
parameters and perform certain tasks. For example, a secondary that
is coming up will attempt to contact the storaged.pl on the remote
side with a "reboot" command. If the primary is in a confused state,
but its copy of storaged.pl is still running, storaged.pl will process
the request for reboot from the remote and reboot itself.
- getstoragestatus.pl is used to dialog with a local or
remote copy of storaged.pl. All interface to storaged.pl is done
through getstoragestatus.pl.
- becomestorage.pl is used to force the machine into a
particular state. It accepts parameters of "start", "stop" and
"grab". The "start" parameter is used when the machine will need to
dialog with the other server and determine if it should become primary
or secondary. The "stop" parameter is used to remove itself from the
cluster. The "grab" parameter is used to force the machine into
primary role and attempt to reboot the other machine.
- hispeed.pl is run continuously in the background and issues
drbdsetup commands on the primary node periodically to force maximum
sync rate. It also checks up on the status of the monitorstorage.pl
script and restarts it if necessary. In addition, it checks to see if
any of the DRBD devices have fallen into "StandAlone" status, which
means that the node lost communication with the other side and assumes
it will not be receiving any more contact. "StandAlone" is a
potentially-fatal situation, because future attempts by the remote
side to reconnect will go unheeded. Typically it's the primary node
that goes to "StandAlone". "StandAlone" means the mapping that was
set up by the drbdsetup "net" command to map a DRBD device to a net
address has been reset, just for that particular device. When
hispeed.pl sees that a device has gone to "StandAlone", it re-issues
the drbdsetup commands which re-map the device to an IP address and
port number.
- monitorstorage.pl monitors the other node of the cluster.
It periodically calls a MOSIX utility, mosctl, and
getstoragestatus.pl to find out the status of the other machine. If
monitorstorage.pl detects that neither MOSIX nor getstoragestatus.pl
can reach the other side, it takes ones of two courses of action. If
the failed machine was the primary node, it initiates a
"becomestorage.pl grab" command and assumes the primary node's
identity. If the failed machine was the secondary node, it falls
through the script and exits without error. The hispeed.pl process
restarts monitorstorage.pl at this point. If the failure is still
occurring on the secondary node, monitorstorage.pl again falls through
and exits. This occurs until the secondary comes back up, when
monitorstorage.pl goes into a wait-loop. Even though
monitorstorage.pl doesn't seem to really "do" anything when it's
running on a primary node, in reality what it's doing is waiting for
the role of either machine to change. In this design of cluster,
neither machine is necessarily the "primary" or the "secondary" all
the time. They can switch roles as failures occur. So, if a machine
boots and right now it's primary (and monitorstorage.pl is doing not
much more than waiting), it may fail, get fixed, and reboot and become
secondary, and have work to do in monitoring the new primary node.
- TKCluster.pm is a Perl module that contains
cluster-specific functions which are called by the above Perl scripts.
- TKgenericfuncs.pm is a Perl module that contains generic
functions I use in this and other Perl scripts.
Now that you have copied the TKCluster files to where they are
supposed to go and checked over cluster.conf, you want to test the
cluster by firing it up with the init.d script. This should be done
first on server1. In one terminal window, touch the logfile and watch
the output:
# touch /var/log/cluster.log
# tail -f /var/log/cluster.log
The "tail -f ..." command will show output to the terminal window as
it is written to the file. Leave this run while you switch over to
another terminal windowto start the cluster itself. You can switch
back and forth between the two terminal windows to watch what the
cluster is doing as it starts:
# /etc/init.d/storagecluster start
You should get output in the "tail ..." window showing that
storaged.pl has started. Shortly after that, monitorstorage.pl and
hispeed.pl should have started as well.
Now you may notice that monitorstorage.pl on server1 cycles once by
stopping and restarting NFS. This should only happen once, as the
first invocation of monitorstorage.pl dies off at a time when it is
not yet sure of the internal state of this cluster node being primary
or secondary. Another invocation of monitorstorage.pl will start up
momentarily, and this one should recognize that the server is primary
and go into its wait-loop. The original monitorstorage.pl goes zombie
at this point, the "Z" status will show up if you do a "ps auxwww |
grep monitorstorage.pl". This is a known bug, I should probably
investigate that and make it exit more cleanly.
If all went well, you should do a "cat /proc/drbd" and see that it's
in the WFConnection state. Now go over to server2 and do the same
thing, running storagecluster with the start parameter.
After this, the result should be that now a "cat /proc/drbd" shows you
that data is being synchronized between the servers, and that server1
is the master. Initially, sync rate will be relatively low at about
250KB/sec, but that will change within 5 minutes. If this has not
been successful, go back through the steps of setting up, making sure
that you tweak the configuration file for any difference you might
have between this example and your own particular setup.
If you have successfully brought the cluster up to this point, you
might just want to try NFS mounting it from one of your client
machines in the network. If you do not have an NFS client, you may
substitute Samba in. Configuration of Samba itself is beyond the
scope of this document, however it should be easy for you to configure
it to work in most relatively simple Windows networks. You may also
wish to run both Samba and NFS at the same time, sharing the same DRBD
device. This is compeltely acceptable, and all you need to do is
write a script to start or stop both simultaneously and put that
script into the PRIMARYSERVICES section of cluster.conf. Because of
the nature of cluster.conf, it is read when the Perl scripts start, so
if you make changes to it while the cluster is running, you will need
to stop and restart the cluster.
One thing to watch for with NFS is that most NFS init scripts pass the
"-au" option to exportfs in the "stop" section of the script. This is
not what we want. By specifying "-au", the NFS server notifies
connected clients that the NFS share is going down. In a normal,
non-clustered system, this is the right idea, however for a cluster,
the service should never be told to inform clients it is going down.
As far as the client machines are concerned, NFS is merely just slow
during the time when one machine dies and the other takes over.
Once the data replication has completed, the next thing to do is test
the cluster and make sure it works. To do this, bring down server1.
You can do this in any way you like: allocate all the memory, hit the
reset button, do a "reboot -f", anything that will immediately bring
an end to "normal" operation. If you are watching the cluster.log
file over on server2, you should see that it detects "minor" failures
(ie server1 is not responding momentarily) and then ultimately
recognizes this as a "MAJOR" failure and begins its own sequence of
taking control of the DRBD device.
Using the default cluster.conf settings, there will be a period of 30
seconds or so where server2 attempts to contact server1, giving
"minor" error messages. If server1 were to recover during this time,
server2 would go back to normal monitoring mode. However, if server1
is truly down, it only takes another 30 seconds or so (possibly less
with faster hardware) to bring server2 into the primary role. This is
a tunable parameter which can be altered by changing the delay values
in cluster.conf. However, these shouldn't be set too low, because
normal network latency due to high load during replication may cause
the cluster to fail-over without there being a real problem. Based on
the speed and throughput of your particular hardware, you can tune the
cluster.conf timings to meet your needs.
By this time, server2 has taken over and is serving NFS and/or Samba
shares to the clients on the network. If any of your clients were
connected during this time and operating on the files in the share,
they saw a pause of about 1 to perhaps 1.5 minutes while server2 took
over. Otherwise, however, their operation was seamless, or at least
should have been. Because both NFS and Samba are disconnect-tolerant
to some degree, the client machines never really saw a problem. As
far as they know, whatever data they were reading or writing during
that time was just slow.
You now have a fully functioning cluster. You can expand on this by
adding more services that share the same DRBD device, or you may add
new DRBD devices to the cluster for separate services.
For use in a web environment, you can now put a webserver, or more
than one webserver, out in front of the cluster, and have them mount
the storage cluster via NFS and put your database engine onto the
cluster as well. This can be combined with web clustering via the NFS
or router "tricks" and NOW you have a truly redundant,
high-availability system to bank your services on. If you've
successfully reached this point, let your imagination do some
exploration and think up new ways to apply this clustering to your
particular application.
I will be the very first to admit that TKCluster has bugs and
limitations. Some are easy to overcome, some will require some more
work. Here is a not-too-comprehensive list:
- Fork/exec/waitpid stuff in storaged.pl is specific to Perl 5.8.0.
I originally wrote this using Perl 5.6.1, and then it broke on 5.8.0.
Perhaps someone can find how to generalize this and "do the right
thing" on a wider variety of Perls.
- Make cluster.conf into a more "normal" config file that is parsed
more normally instead of eval'd, thus removing anyone's security
concerns.
- Bundle up the TKCluster package into something more easily
installed without having to manually cp stuff.
- Test on a wider variety of hardware and distributions, and
construct an install script or RPM/DEB which will install it properly
into those distributions as needed.
- Allow for multiple NICs in the private cluster so that if one
gigabit interface fails, it allows for falling over to a standby
interface. Right now, all communication goes over the single gigabit
card. A dedicated 10/100 or a pair of dedicated 10/100's could be
used for storaged.pl to talk over instead of the gigabit.
- A web interface would be nice. It would be nice to have a third
node, not necessarily a cluster member, which could be used to
administrate either node of the cluster and their operation remotely.
Since everything is Perl and most functionality lives in the Perl
modules, building a web admin app out of it shouldn't be rocket
science. This would make it more convenient for those deploying this
into a rack in a server room to administrate from their desktop or a
dedicated node in the rack.
- Cleaning up of the code. I admit my Perl is somewhat sloppy in
places. It should be beautified and cleaned up.
- Better/more documentation and comments within the code.
- There are probably more bugs and limitations.
Clusters used for high-performance computing and data-replication used
to be high-margin items available only from large vendors. However,
with the freedom of Linux we have the opportunity to build these
clusters ourselves now. TKCluster is one attempt at merging both HPC
and HA clusters into a single system that is free, easily installed,
and runs on commodity PC hardware.
I welcome your comments and suggestions. If you think this might be
useful, feel free to use it. If you have some ideas and code you
think would improve it, go right ahead. Please send me your
thoughts, suggestion, code, diffs, whatever that could be
improved.
Tom Kunz lives with his amazing wife and children in the Pocono region
of northeastern Pennsylvania. Tom has recently opened a web/mail
hosting and custom software business, and as always, is looking for
customers. His primary goal right now is to build his business to the
point where he can adopt an entire family of orphaned children from
Russia (international adoption is expensive!). His website is
http://www.SolidRockTechnologies.com/ and can be
reached via email at tkunz - at - SolidRockTechnologies.com
|