FRAMINGHAM (03/10/2004) - The role of storage systems in high-performance computing (HPC) efforts is increasingly under focus. This is particularly true in life science HPC research, where clusters with hundreds or thousands of nodes are used to analyze large data sets.
This is an area where storage vendor Panasas Inc. thinks it can help. Last fall, the company launched the Panasas ActiveScale Storage Cluster, a object-based storage cluster aimed at the HPC Linux cluster environment. Panasas is targeting this system at life science applications.
"We're focusing on production technical computing," says Bruce Moxon, Panasas' chief solutions architect.
Moxon has lots of experience working in HPC environments with large databases. He's been a consultant to life science companies, including Affymetrix Inc., Celera Genomics, Incyte Corp., and Rosetta Inpharmatics Inc. He recently architected, designed, and implemented a high-throughput computational pipeline and analytical data warehouse for Perlegen Sciences' 100+ TB human genome variation (SNP) repository.
He notes that by "production technical computing" he's talking about looking beyond the raw processing capabilities of a cluster. "What's important is the number of jobs you can run in a certain time," says Moxon.
Scope of the Problem
The main issue with storage systems in large clusters is that lots of data must move to and from the physical storage devices and the memory of the individual nodes. The impact of this data handling and movement is that many common life science informatics applications (e.g., BLAST, BLAT, FASTA, and HMMer) can have their performance significantly degraded.
For example, the normal approach when using large Linux clusters is to stage the data, such as the data in GenBank, by propagating the data to all nodes. "In some large (thousand-node) clusters, this can take hours," says Moxon.
If a cluster is used for multiple applications, the process likely involves loading the data for one application, unloading it when that application is run, then loading different data for the next application.
The nodes in a cluster help manage the input/output (I/O) tasks required in these data movements. For example, a typical sequence analysis job -- a BLAST run, for example -- requires the reading of a large data set and outputting a large data set. Managing this process can take up to 80 percent of a CPU's capacity -- which means that only 20 percent is used to perform the actual computational tasks.
As a result, the compute time to complete a job dramatically increases -- so the number of jobs performed in a set time period is significantly reduced.
The Panasas ActiveScale Storage Cluster distributes I/O workload across a large number of StorageBlades. These are intelligent disks that offload and parallelize much of the I/O activity, which is a bottleneck in traditional networked file systems. The result is high aggregate performance for large cluster computing environments.
To implement this shared system, Panasas relies on an object-based storage architecture where data files are turned into data objects that include the application data, metadata, and other attributes.
When an application requests data, a client application sends a file request to a cluster metadata manager, which authenticates the client and then passes along information on where to find the data objects requested by the client. The client then uses its credentials to perform parallel direct-to-disk requests across a Gigabit Ethernet switching network.
When Panasas announced its ActiveScale Storage Cluster last fall, analysts noted that the increased use of high-performance clusters required a new approach to storage.
At about the same time, that sentiment was echoed at the SC2003 conference with the announcement of a new high-performance storage challenge/initiative called StorCloud -- to be held at this year's SC2004 show. The role of StorCloud will be to demonstrate new high-performance storage technologies for HPC systems.