Feature: Facing the online endurance marathon

Maintaining 24-hour service online is a massive effort

Maintaining web-based businesses 24x7 is an Olympian effort that requires a mix of technology, systems and investment in redundancy that is far from simple.

According to a recent Intel case study, Chinese e-commerce site Taobao.com says it all comes down to resilient and powerful Intel microprocessors and architecture — coupled with virtualisation.

New Zealand’s own taobao.com, Trade Me, however, says there is more to it than that and stresses the availability of redundancy in its architecture, a theme that emerges with other major websites as well.

Trade Me’s head of technology, David Wasley, stresses the importance of redundancy, saying Trade Me is run from two sites, Auckland and Wellington. There is no primary or secondary site and everything can be run from one site or the other, he explains.

Trade Me says it doesn’t operate on such a scale that virtualisation would offer much benefit, but, with 100 servers in total, it has the capacity to take one site down for maintenance while the other takes up the load.

Sleeping easily

This also allows staff to undertake the work during the day, rather than in the middle of the night. Of course, there are other solutions, such as backup and restore, and security.

“All our data is replicated between each site, so if we lost Auckland, we have everything in Wellington,” Wasley explains.

“We also run a lot of security systems, from standard anti-virus to a comprehensive anti-intrusion detection system from IBM, with firewall and all the standard security layers. The IBM intrusion-detection system gives us very good visibility of all the traffic going through the network.”

Earlier this month, the was a major SQL attack, which Trade Me was able to monitor and protect itself from using the system and pulling up data on a management console.

Wasley stresses “commodity” servers are used, with no preference to AMD or Intel. The important thing is that Trade Me can replace them quickly when needed or scale up and ramp up processing too.

“We put a lot of emphasis on redundancy because we want our guys to sleep,” he continues.

“We allow three to four things that can fail and still be able to operate. We are really good with our alerting and monitoring, and error reports are available to all. At any one time, we have three people on call.” And reliability is enhanced by having the work and the “smart people” in-house.

Trade Me represents a special challenge because its huge traffic levels mean that if there is a failure it can suddenly spread.

“We can have 70,000 people online and 10,000 auctions closing within an hour. This means we have to be on top of things. We recently had a brief failure because one of our databases was not being nice to us. We are still looking at the underlying cause. We had a CPU spike which resulted in our network traffic dropping for a four-minute period,” Wasley explains.

However, he reports that the Trade Me site has been down for only “single digit minutes” during the past year.

Managing failure

Issues of redundancy and failure were also highlighted by Air New Zealand.

General manager of group IT production Ed Overy says the airline maintains a high availability infrastructure and application suite that manages individual component failure without compromising the website. It also has the flexibility to promote managed changes to applications and infrastructure without disrupting service.

“Generally, each component has at least one hot fail-over and we also have the entire platform distributed over our primary and secondary datacentres, to cover for disaster recovery,” Overy explains. The airline also operates in an environment that caters for the extraordinary peaks in load generated by customer demand in response to promotional campaigns.

“In these situations, the web technology caters to over 20 times the ordinary traffic. Economically, building and maintaining systems that cater to these peaks is challenging,” he says.

“We cater for the large capacity demands by sizing some components to deal with the load (firewall) or the smart use of content routing and caching.”

Air New Zealand’s web infrastructure is a J2EE tiered architecture, including internal and external firewall management, load balancing, content management, application servers, database servers, middleware services and mainframe.

Technology employed includes Checkpoint firewall, IBM Websphere application server, Message Broker, Oracle Database, Solaris and Red Hat Linux, and ALCS & Z/OS on the IBM mainframe.

Applications are largely developed in-house, with infrastructure support outsourced. Gen-i, IBM and OSS are the key supplier for web platforms.

Overy says Air New Zealand enjoys some benefits of virtualisation, particularly in the middleware services platform using VMWare, where 14 servers are virtualised across two physical servers. But virtualisation is used more as a way of providing provisioning, rather than as a way of providing greater resilience or availability.

Air New Zealand also has its systems managed by a 24x7 on-call support team comprising both its staff and that of its suppliers.

Multiple back-up

Likewise, Telecom New Zealand maintains 24x7 reliability for its own website, and that of its Ferrit e-commerce offshoot, by using technology like Layer 4 switching and load distribution, the use of multiple redundant nodes; plus internet connections and applications that allow rolling non-service interrupting software. The telco aims to have its websites running 24x7 but not all systems need to run simultaneously to do this.

“We run a 24x7 second-level network operations centre which hands over the third level of support of required. This isn’t required for all applications of course, just those that have a high financial or customer impact if they are not available. We aim to maintain three consecutive backups online, so we don’t need to ship tapes in the event of a failure requiring restore from back-up,” says spokeswoman Michela Enna.

The processing power of servers, a robust IT infrastructure and having reliable IT partners also come into play. However, Telecom believes a fundamental requirement is to understand your system, its requirements and the current levels of utilisation across the entire platform, including networks, and to make sure you are aware of bottlenecks before they become issues.

Maintaining 24/7 services is always a challenge, says the company.

“A common issue that many technology teams are facing, across many businesses, is managing the volume and magnitude of change through the systems. The business drivers for change are accelerating and we are constantly reviewing our control and change management practices to ensure we can overcome challenges as/if they arise,” Enna adds.

Effective protocols

Systems integrator Gen-i highlights further challenges.

Emerging technology strategist Karen Monks says customers’ demands are rising, and online businesses face greater pressures than bricks and mortar stores because they cannot close down for five minutes. The world has also moved on from simple static pages involving interaction with a real person to complete an order, to one of ordering by credit card and not only having your details processed but retained for next time, including recommending similar products you might like.

Monks says most major e-commerce sites regard server virtualisation, redundant hardware and very quick failover as standard. The stability and flexibility of modern architecture also ensures these businesses can run 24/7.

“However, what really makes a difference is putting in place effective protocols for managing unplanned outages. We help clients to train staff so they understand a system holistically, not just their specialist area. This can avoid a quick outrage turning into a major event,” she says.

New Zealand is not lagging behind in the technology stakes, but sites can have problems from growing piecemeal and trying to accommodate a growing load.

Security also matters, not just the issue of denial of service attacks, but also hacking the domain name itself and gaining credit card details. Websites may also use “region restrictions” like not accepting credit cards from areas where credit card fraud is rife.

This has led to many sites turning to specialist third-party sources to store credit card details and boost safety.

While the underpinning hardware and software are critical to the operation of a site, Monks says it is the quality of the customer relations and stock management that people notice.

“Experiencing a few seconds of lag, while pages load, is more acceptable than sending the wrong product or not providing support when something goes wrong. Customers will judge an online business on the basis of the whole experience, not just at the point at which they enter their credit card details,” she says.

SaaS challenge

The concept of communities is growing too, with websites accommodating technologies that allow this — such as publishing feedback — and sites are also determining what comments are allowable. Such issues need to be considered.

So, what if you are on both the demand and supply sides of the e-commerce fence?

Touchpoint provides clients with an interactive marketing platform, under a SaaS licensing model, which sees it host customers’ websites and databases, and create interactive marketing campaigns that include web, email and mobile. It has several hundred customers across Australasia, and overseas too.

“Saas applications are harder than just websites, as you need to maintain the software, the databases and numerous messaging components,” says CTO Steve Sherman.

Having this 24x7 is also mission-critical, particularly as global operations and time differences make downtime harder to schedule. Upgrades of applications thus happen when systems are “live”.

Touchpoint relies predominantly on Linux, Oracle and Java , with specialist software for email and SMS messaging.

“We have evolved sophisticated monitoring of servers and services that provide our infrastructure support team with 24/7 alerts on systems performance and availability. These are unique to our application — not just systems,” he says.

In addition to upgrades while “live”, it also monitors ISP and telco connectivity problems. Thus, customers maybe told of an ISP failure or another ISP may be used for the re-routing of messages.

Touchpoint uses commodity servers to scale out not up, typically multicore CPU Dells.

Its architecture allows multiple servers to do the same job, and redundancy and failover is built into the platform. This allows for upgrades while “live” and, if there are demand-peaks, extra servers can easily take up the peak-load, Sherman explains.

Backup and restore is outsourced to the Telstra datacentre, which also manages Touchpoint’s firewalls.

The company has its own specialist staff, but it can rely on other staff from Telstra and Oracle when needed.

“To ensure we have 24/7, we have evolved our monitoring systems to also look at the underlying health of the applications. You can have problems at the application level,” he adds.

Virtualisation helps reliability, but Touchpoint claims more progress through designing a distributed application/ system and building virtualisation into this.

“We create a separation between the service and part of the application, and the actual physical hardware it is running on, so the hardware becomes a commodity.

That’s a similar approach to what companies like Google would take.”

Indeed, the company also looks to eBay and Google for inspiration in devising technologies to help maintain a 24/7 service.

In addition, within applications Touchpoint also use dynamic throttling and adaptive applications to allow resources to be redistributed when required.

Sidebar: Paymark where 24x7 is ingrained

Join the newsletter!

Error: Please check your email address.

Tags gen-ionline serviceSpecial IDTrade MetelecomAir New Zealand

Show Comments

Market Place

[]