No room for server failure

Sometimes, you simply can't afford a server failure

Sometimes, you simply can't afford a server failure. Maybe it's a system that monitors safety conditions at a factory, or an application that records financial transactions. Maybe it's a bottleneck back-end database server that supports a distributed Web farm.

Stratus Technologies Bermuda's ftServer 6600 fault-tolerant server is expressly designed for these grave situations.

Nearly everything is redundant on the ftServer 6600, including the processors. That degree of high availability comes via a modular approach. The server consists of a 10U-high (17.5 inches) rack-mountable chassis that includes seven slide-out pizza-box modules: two for redundant hard disks, two for redundant PCI I/O cards, and two or three for redundant motherboards and processors.

All of these field-replaceable modules are cross-connected using a passive backplane, and each contains its own AC power supply and cooling system.

Everything on the server runs in lockstep, even the Intel Xeon processors and the memory chips within the processor modules. If one module of a pair fails, the other instantly picks up the workload without dropping a transaction.

This approach isn't new. Many proprietary servers, including those from Tandem (now owned by Hewlett-Packard), offer similar lockstep-based protection. Stratus has sold a line of fault-tolerant servers for some time.

But what sets Stratus' approach apart is that it uses Intel Xeon processors and runs Windows 2000 Advanced Server or Windows Server 2003. Compared with Stratus' older ftServer 6500, the new 6600 model packs faster processors, disks, and I/O into about half the space.

Available this month at under $100,000, this truly is fault-tolerant hardware for the masses, but it's more expensive than setting up clusters of lesser servers.

The ftServer 6600 boots up using only a single set of processor, disk, and I/O modules. After the operating system starts, it loads a set of drivers and utilities that brings the redundant hardware online. This software is designed to keep each module operating in lockstep with its partner.

One module of each type is always set as the active component while its partner monitors the active component's behavior using watchdog circuits and backplane cross-connects. The partner automatically takes over if it detects a failure in the active component. It also sends out alerts via SNMP, email, and modem when a component fails.

At least that's what it should do. The test system didn't quite meet expectations.

I pulled the disk modules one at a time and replaced them easily. The server did not hiccup, and tasks continued to run without delay. When the disk slice module was replaced, LEDs indicated that the module was powering up and synchronizing; within a few minutes, the status lights showed that full fault tolerance was restored.

There was similar fault-tolerant behavior with the processor modules; the server worked perfectly with either module installed.

But then when the third module type was tested --the I/O slice that contained the PCI slots — the server crashed a few seconds after it was reinserted. When the problem was simulated via a software-based shutdown and restart of the I/O module through the administrative software, it also caused a crash, so the problem was not caused by faulty connectors or wiring.

After much investigation, engineers at Stratus decided that this was a known issue with that generation of preproduction server, caused by electrical signaling issues on the server's backplane. According to the company, this fault was discovered and repaired before releasing customer beta hardware.

That problem aside, the server is impressive. Stratus clearly put effort into not only building the fault-tolerant hardware and software drivers, but also into making it manageable via a Microsoft Management Console snap-in or a browser. Each processor module contains a modem designed to phone back to Stratus' datacentre if faults are discovered in the server, and to allow Stratus' engineers to call in to diagnose the server. The company was then able to gather data to debug the crash problem during my tests.

The only fault with this extremely fault-tolerant server is the aforementioned signaling electrical problems. But assuming the company has fixed that blip, the ftServer 6600 is highly recommended for Windows applications that need this extreme level of hardware high availability.

Stratus shipped me an early preproduction server and, like a Lego toy, the company's engineers worked on-site to assemble the machine's seven-slot chassis and install the two sets of processor, disk, and I/O modules. The end result: A four-way 2.8GHz Xeon MP server with 4GB RAM, two Ultra160 SCSI hard drives, and two PCI expansion slots.

Actually, multiply that by two. All the hardware was doubled for fault tolerance, so the test server actually had eight processors, 8GB RAM, four hard drives, and four PCI slots, which must be populated in identical pairs. Note that the company has only a small list of cards, mainly Fibre Channel and SCSI adapters, that it has qualified for installation in this server. The test system had two 18GB boot drives and two 36GB data drives; the company didn't provide any PCI cards for the I/O modules. The two servers share a common set of USB and video adapters.

Next, Stratus installed Windows 2000 Advanced Server (Service Pack 3) — a process that the company insisted its engineers perform, because they replace some of the Windows drivers with their own "hardened" code, as well as adding other pieces of software to manage the synchronization and failover of the lockstep devices across the backplane, replicate memory between the two processor modules, and monitor and manage the devices. The company also offers Windows Server 2003, but I chose to stick with Windows 2000, simply because of my greater familiarity with the older code base.

Testing the ftServer 6600's fault-tolerance features consisted primarily of running continuous tasks on the server while unplugging, failing, and then re-enabling its hardware components. Those tasks included operations that repeatedly wrote the system clock time into a log file, so that any disruptions in operation could be easily seen, as well as external health monitoring via Rational's SiteLoad load-testing software.

Join the newsletter!

Error: Please check your email address.

More about Fibre ChannelHewlett-Packard AustraliaIntelMicrosoftSNMPStratusStratus TechnologiesTandem

Show Comments

Market Place

[]