FRAMINGHAM (01/15/2004) - Traveling today has been transformed by online resources such as MapQuest, Travelocity, and Orbitz that provide directions and information. In many ways, accessing and using the explosion of informatics data requires similar resources. Researchers need to know the location of a database, then track down details (perhaps from the experimenter or a lab notebook) about how the data were collected, what format they are in, and so on.
This dual process is manageable when confined to just a few data sources. But matters get complicated when multiple sources need to be taken into account. Given that researchers often must evaluate many different types of data, such complications can seriously impede research efforts.
The vast array of informatics data available today makes it difficult to automate data access and sharing, which in turn makes it difficult to set up production-quality workflows in the research environment.
The Interoperable Informatics Infrastructure Consortium (I3C) is trying to tackle such issues with a new naming standard and data access protocol called the Life Science Identifier (LSID). At its core, LSID provides a uniform way to name and locate specific pieces of informatics data over the Internet.
The I3C, whose members come from life science companies, academic labs, and vendors such as IBM and Oracle,
hopes that LSID will help enable interoperability between informatics applications. "The consistency of LSID makes it valuable to the integration of data resources," says Andy Ellicott, executive director of the I3C and formerly a program manager at Infinity Pharmaceuticals.
Others in the industry echo the potential value of LSID. "With a common standard for data retrieval, scientists across organizations may easily share data, facilitating collaboration on projects such as drug discovery and disease research," says Ben Szekely, a software engineer at IBM.
Directions to the Data
Work on LSID began in early 2003 when the I3C, in conjunction with vendor members including Sun Microsystems Inc. and IBM Corp., developed a specification for naming data resources. Similar to a URL, LSID uses a uniform resource name (URN) to locate data. The URN contains five parameters that uniquely identify the data of interest.
An example of a URN is: urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2.
Taking each element in turn, urn:lsid is a mandatory preface for LSID data; ncbi.nlm.nig.gov is the Internet domain of the organization that assigned an LSID to the data; GenBank is the name of the data resource; and the last two parameters are the name and version number of the specific data element.
To take advantage of the URN in an informatics application requires two pieces of software: a client piece within an informatics application, and a server piece associated with the actual data.
The LSID server software sits in front of an informatics database. The LSID "is a layer above your database, so you don't need to modify the data themselves," says Sean Martin, a senior technical staff member at IBM.
This is a key point about using LSID. "There are lots of life science databases, so you can't expect people to throw away or change existing schemas in how they get their data," Martin says. With LSID, an organization continues to use its normal routines for generating and storing informatics data. The only thing that changes when using LSID is that there is an alternative access route.
Once database owners have LSID server software in place, they can advertise the fact by making the information available to the informatics community. The information would be sent to a so-called LSID authority, which contains a list of all the available LSID data resources.
An LSID authority could be set up within an organization and used as a repository for all internal data sources. Similarly, industry organizations such as the I3C or other groups could establish their own LSID authority.
An informatics application that wants to access these data needs client LSID software. This software queries an Internet domain name system (DNS) to find the network location of the appropriate LSID authority.
Once the LSID authority for the specific data is known, the informatics application makes a request to that authority, which returns a document that includes the location of the data and metadata that gives practical information (data-generation procedures, format, etc.). The information in this document is then used by an informatics application to retrieve the data.
If the data is a single Web page, this entire process simply maps the URN to a URL and retrieves the article. However, most data cannot be represented by a URL: For instance, if an investigator wants to access information about a protein in the Protein Data Bank, a simple URL will not suffice.
Much genomic research routinely relies on a standard framework -- the Internet, a browser, Perl scripts, and a number of bioinformatics databases and algorithms. But as researchers use more disparate investigative technologies and databases, there is an increasing need to identify data more quickly, provide access, and enable greater interoperability between bioinformatics applications. LSID addresses many of these issues.
Vendor Adoption Is Key
During 2003, LSID moved from a concept to a specification. And IBM developed open-source LSID software that some developers are already using and evaluating. But a true test of LSID's acceptance will be the interest I3C can muster from the informatics vendor community. The early signs are encouraging.
"We've just started talking about using LSID in our products," says Robin Smith, founder and chief science officer of chemical informatics tool vendor Synthematix. "Conceptually, I like the open, collaborative nature of LSID -- it seems like a natural fit for us."
"(With our products), we chose Java, open-source coding technology, XML, and standards," Smith says. "This allows us to code very quickly, make changes quickly, and import data quickly," -- all of which validate LSID.
"Collaboration isn't well done in organizations; there are lots of silos," Smith says. "But people are evolving to more open standards -- not just for their infrastructure but for their applications, too." That is why LSID appeals to a company like Synthematix.
IBM and Optive Research are also exploring the use of LSID within their products. Indeed, several vendors indicate that although they are not yet actively looking at LSID, they are keeping an eye on developments.
Some vendors have invested a lot of resources to develop data access technologies of their own. Normally, something like LSID would be of little interest to these vendors because the proprietary data access technologies in their products are a huge selling point. But Ellicott thinks LSID might have broader appeal. "Building in LSID support can be as simple as adding a menu item in an existing application," he says. "It's much easier to add in access to more data sources."
For example, when an e-notebook or informatics analysis application vendor wants to support a new database, it would typically partner with the data supplier. When a new experimental tool comes along -- for example, a protein chip -- the chip-building vendor determines the data format and the method of storage in a database. If the vendor chooses to use a proprietary database to store the data, the informatics application developer must write software that taps into an associated application programming interface (API). Each new experimental technique requires a new API.
But if the chip-building vendor adopts LSID, the informatics application developer need only use LSID's URN within its program to locate and access the data. "LSID would save vendors a lot of development expense," Ellicott says. "It would also (ensure) faster delivery of new informatics tools."
If that isn't enough incentive for vendors that have invested heavily in data integration up to now, Ellicott notes that "even if they've partnered, partners evolve." For instance, many life science equipment companies change their APIs as new gear is rolled out.
Obstacles to Adoption
Conceptually, adding LSID support to existing databases and informatics applications seems like a no-brainer. But as with any proposed standard, obstacles lie along the way.
One area of resistance could be from industry groups that have already invested significant development time in data sharing. For example, efforts such as the distributed annotation system (DAS), BioMOBY, and OmniGene have provided users in these communities with tools and common, easy-to-use methods for accessing relevant data.
But Ellicott doesn't see LSID as usurping any of the work these groups have done. "These efforts are very domain-specific," he says. "LSID is above this, being discovery-independent and discipline-independent." These organizations can keep their existing methods of sharing data, but LSID could also make those data available to other researchers outside the scientific domains encompassed by these groups.
A second potential obstacle to widespread adoption of LSID is user comfort level with the technology. In the past, many IT industry efforts have started out with beneficial goals, but the industry effort quickly turned into a self-serving, vendor-driven initiative.
To circumvent this, I3C is making the LSID a formal standard. "Within the I3C we have developed a specification for LSID," IBM's Martin says. "The standard will come from the object management group."
Perhaps the most serious issue with LSID revolves around security. The issues here are not specific to LSID, but pertinent to any effort to make data available through the Web. A key benefit of LSID is that providers that make their data available as a Web service can simplify the way data are accessed by informatics applications.
But Web services adoption in general is being slowed because of Internet security problems. "What's lacking in many applications today is security, access control, and control mechanisms," says Bernie Wess, president of Perseid Software. "What's needed is next-generation security that includes better authentication and nonrepudiation."
While waiting for improved Web security, those wishing to use LSID do have ways to safely share data today. For instance, most companies have no problem using the Internet to access PubMed, GenBank, Swiss-Prot, and other public databases.
If a company wants to share confidential or proprietary data using LSID, it could set up an internal LSID authority that sits behind the corporate firewall. In this way, a URN lookup would direct the informatics application to an internal server that supports the in-house database. Outside users would not have access to this LSID authority.
LSID represents an alternative way to access informatics data, but it does not obviate existing work in this area. Nevertheless, it is an alternative that some believe is greatly needed. A final key challenge to LSID adoption is industry awareness. To that end, Ellicott is acting as an emissary. For those interested in learning more about LSID, go to MapQuest and type in "900 Boylston Street, Boston." That's the location of the Hynes Convention Center, where the Bio·IT World Conference + Expo will be hosting an LSID workshop.
At the heart of the LSID is a uniform resource name (URN) that specifies a common way to access life science data through the Web by uniquely identifying data resources. If widely adopted, URNs would help enable interoperability between various informatics applications. A URN contains as many as five elements, including:
LSID Designator: A mandatory preface that notes that the item being identified is a life science-specific resource
Authority Identifier: An Internet domain owned by the organization that assigns an LSID to a resource
Namespace Identifier: The name of the resource (e.g., a database) chosen by the assigning organization
Object Identifier: The unique name of an item (e.g., a gene name or a publication tracking number) as defined within the context of a given database
Revision Identifier: An optional parameter to keep track of different versions of the same item
A PubMed article: urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434:
The first version of the 1AFT protein in the Protein Data Bank: urn:lsid:pdb.org:1AFT:1
The second version of an entry in GenBank: urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2
On March 31, the I3C and Bio·IT World are teaming up to offer informatics and IT professionals a daylong workshop on LSID technology. The workshop will take place as part of the Bio·IT World Conference + Expo in Boston.
Nearly a dozen presenters, including I3C members, vendors, and life scientists, will cover:
· What LSID is, how the specification was created, why, and by whom
· How it will change the informatics landscape
· How to develop and integrate LSID-enabled resources
· How software vendors and bio/pharmas are adopting and benefiting from LSID
· The future of LSID technology and how to participate in its evolution
See www.bio-itworldexpo.com for more information about the workshop.