At first glance, Farecast.com’s claim that its website can predict with 75% accuracy whether a particular airfare is going to rise or fall in the next seven days doesn’t sound that impressive. Isn’t flipping a coin accurate 50% of the time?
The Seattle firm uses a finely-tuned data-mining engine to analyse more than 150 billion airfare price quotes from the past 18 months, to and from 75 major US cities, to come up with its prediction. And, garnering that extra 25% certainty on fares, apparently, really matters: more than a million unique would-be fliers have tried the free Farecast.com service since August.
“It’s a very complex problem,” says Jay Bartot, vice-president of technology at Farecast.com.
“Our data-mining engine is very large and sophisticated. We do a lot of post-processing, deriving new data from our existing data, which is then fed into our predictive engine.”
In other words, Farecast.com is constantly generating airfare predictions on its own, in addition to those requested by consumers. It then checks its results against the actual price quotes generated by the airlines, allowing it to figure out how accurate it really is and further fine-tune its data-mining operation.
Started in 2003, Farecast.com was spun out of research by professors at the University of Washington and the University of Southern California, yielding what is essentially a business intelligence service for consumers. The choice of which database technology to use was key and the company experimented with several open-source databases, including PostGreSQL and BerkeleyDB, before initially settling on MySQL.
Even so, as Farecast.com neared its launch date, Bartot worried. “We knew we would have to scale-out in a major way. I had read some stories about companies doing huge rollouts of MySQL clusters, but in at least one case relevant to us it turned out to be more of an experiment,” Bartot says.
Having had experience with Oracle at previous jobs, Bartot decided to move off MySQL to an Oracle 10g-based grid.
Farecast.com now runs a four-node cluster using Real Application Clusters, Oracle Partitioning and Enterprise Manager 10g. Each node is running SUSE Linux Enterprise Server, with two dual-core AMD Opteron 275HE processors, 8GB of DDR 3300 memory and remote NFS-attached storage.
Farecast.com dedicates one node to ad hoc queries. Another node handles administrative tasks, while a third node handles the key task of loading data. With information-sharing agreements from all of the major domestic carriers, except for Southwest Airlines, Farecast.com adds more than three billion airfare quotes each month. That data arrives around the clock, in XML format, via a provider called ITA Software, before being transformed into SQL and other formats by Farecast.com’s in-house tools. It is then loaded into the Oracle data warehouse.
Bartot called the Oracle technology “robust” and “a great product” that is easy enough for just two system administrators to handle.
There are rare occasions when usage has spiked enough to cause the Oracle database to lock up, but, Bartot says “They are easily things at the application layer we can change to fix it. Nothing I would characterise as a shortcoming in Oracle”.
The 5TB data warehouse, which is compressed to 1TB to fit on disk, is growing fast. According to Bartot, the company is adding more data sources, such as airfares with non-US cities, in preparation for a likely launch of international price predictions late in 2007.