A new system for locating and clustering highly similar documents on the World Wide Web could help solve a range of the Web's information management problems - from fixing dead links to monitoring intellectual property.
The system has been outlined in 'Syntactic Clustering of the Web', a paper presented by three Digital researchers to the International World Wide Web Conference in California. Their work received the "best paper" award at the conference.
The reseachers - Andrei Broder, Steve Glassman and Mark Manasse of Digital's Systems Research Centre - have created several algorithims to describe the similarity and "resemblance" of separate documents available on the Web.
They point out that many documents on the Web are "syntactically similar". That is, they are identical apart from their differing locations (such as FAQ files); or they are the same apart from formatting, customisation, contact addresses, updates, links or other changes.
The researchers have also addressed "URL instability" - where a document cannot be obtained because although it still exists, it has been moved; where a URL refers to an old version of a document when a newer version exists; or where a URL is too slow and a user wants an identical or similar doument that will be easier to retrive.
The researchers anticipate:
• "Lost and found" services to help users locate Web pages that disappear over time.
• A page repair service to track and adjust hyperlinks as documents are moved over time.
• Clustering of search results (so similar versions of a document are not returned as separate entries).
• Easier global updating of important information, such as regulations
• More accurate characterising of the way pages change over time, which should help in the design of proxies, directories, search engines and browsers.
• Easier detection of of illegal copies or modifications of intellectual property.
• Applications of their algorithims to media types other than text.