Making sense of the web’s new ‘infinite library’

The search engines that guard entry to the treasure that is the worldwide web are modern-day dragons, according to a new book on the subject, writes Stephen Bell

Web Dragons: Inside the Myths of Search Engine Technology is a praiseworthy attempt to make sense of the meteoric progress of the worldwide web and its impact on our lives. The authors, who include Waikato University computer science professor Ian Witten, begin with philosophy and magic; with Plato, Austrian philosopher Ludwig Wittgenstein and Argentinian “magic-realist” author Jorge-Luis Borges.

Added to this are allusions to the dragons that give the book its title: the mythological beasts that are often gatekeepers to treasure hoards — the position search engines occupy in relation to the treasure of knowledge (and pit of rubbish) that is the web.

“The impious maintain that nonsense is normal in the Library and the reasonable is a miraculous exception,” says Borges, in his fable about the infinite library which contains every book that could possibly be written. Surely, says Web Dragons, a passage that speaks powerfully of the state of the web.

Plato’s lesson is that the search for truth is almost impossible because one must first know what one is looking for. Wittgenstein’s observation is that large chunks of reality are determined, through language, by community consensus. The web, say the authors, “externalises knowledge” through consensus use of language in exactly the same way.

Such evocative scene-setting will fascinate some but bore those anxious to get on with the nitty-gritty of search-engine algorithms. But before the reader comes to these, he or she is taken on a canter through the history of library digitisation (Project Gutenberg and the US Library of Congress Million Books project), as well as some early cybernetics and computer science history. A summary of basic concepts of the web (http, html XML) is also included and ensures no one is left floundering in acronyms. A potted history of literary concordances give an idea of the size of the problem involved in indexing the web.

However, a search engine is more than an index, says Web Dragons. Most, like Google, attempt some ranking based on the reputation of the various information sources. They also help users locate the hubs that have the most onward links to material on a particular subject. The blogosphere adds further to this and adds a timely element.

The book chronicles the first forays into the semantic web, and search tools’ attempts to enable meaning rather than merely focus on words.

Documents are enhanced using metadata, fitting their material to logical categories and, hence, supporting what is called “automated reasoning”, rather than simple-minded keyword searches. In a typical down-to-earth example, the authors show how this should enable the category of “sweet pastries” to be located, which can then be narrowed down to those filled with fruit; thus allowing one to find a recipe for apple strudel without having to remember what it is called — a partial answer to Plato’s problem.

The “Web Wars” chapter tackles how the delicate balance between tolerating the crawlers that feed search-engines and blocking them is maintained, especially when they threaten to overwhelm resources, either unintentionally or deliberately (as in the case of botnets and spam). The authors discuss how techniques for improving the visibility of websites and promoting products for sale can be useful but can also shade into the unethical.

Web Dragons then moves on to discuss who controls information, pointing out that search engines and those who artificially inflate the rank of their sites can effectively bury large and potentially useful portions of the web.

From here, the book moves on to a competent but necessarily open-ended discussion of the weighty topics of privacy, censorship and copyright. It concludes with a stab at possible future developments. These include curated digital libraries; communities of interest maintaining their own metadata; individuals trusting some of their personal data-holdings to the web, and the influence of rising peer-to-peer networks. All are covered briefly, but in a way that inspires further thought.

Optional exercises are given at the end of each chapter, to help bed home points of discussion. This makes Web Dragons a book one would take at least a week or two to read and use fully. Fortunately, it’s also amenable to selective dipping. The biggest concern is that some of the web links liberally sprinkled through the text will change or die, reducing the book’s usability over time.

Web Dragons: Inside the Myths of Search Engine Technology, by Ian H Witten, Marco Gori and Teresa Numerico. Published by Morgan Kaufman. Available through US$19.77.

Italian dreams

The joint effort that became the Web Dragons was seeded during a visit by Waikato University’s computer science professor, Ian Witten, to Italy.

This was followed by a more extended visit, sponsored by the University of Siena, where co-author Marco Gori is professor of computer science.

The book’s third author, Teresa Numerico, teaches network theory and communications studies at the University of Salerno, Italy, and is also a philosophy of science researcher.

