National Library ahead of UK in web harvesting

Three whole-of-web-domain crawls have already taken place

New Zealand’s National Library is well ahead of UK efforts to harvest and preserve web content.

Six libraries in the UK, including the national institution, the British Library, will now start collecting, preserving and providing long term access to the nation's digital output - including blogs, e-books and the entire UK web domain.

The key step has been to get legal and regulatory apparatus in place to provide for “legal deposit”.

New Zealand, however, has had legislation in place since 2003 and appropriate regulations since 2006.

“With the support of the legislation and regulations the [National] Library has been undertaking a three pronged approach to web archiving - selective, where a particular web site is deemed to have long lasting value, thematic where a number of web sites are selected for harvesting as representative of a particular social or cultural event (eg an election) and whole-of-domain harvests where a snapshot is taken of the whole .nz domain,” says Steve Knight, the National Library’s programme director preservation research.

“The National Library of New Zealand also harvests New Zealand related sites in the .com, .net and .org domains where we are able to isolate those,” he says.

The legal deposit legislation and regulations are the same in the UK and New Zealand. There are differences of scale, Knight says, and the British Library has to consider how to distinguish territorial issues, for example how it distinguishes between English, Scottish, Irish web sites within the .uk domain.

The British libraries are aiming to undertake a crawl of the .uk domain on an annual basis. To date the National Library of New Zealand has undertaken three whole-of-domain crawls in October 2008, April 2010 and March 2013.

“The [NZ National] Library is currently constrained to providing three simultaneous accesses to a Legal Deposit document and ‘must not make the document available on the internet’, Knight says. “Online documents are therefore not available as fully as they could be in the internet age.

“There are some complexities in making the whole-of-domain material available including indexing and delivery for such large datasets in a manner that would be meaning full to users,” he says. “As part of the 2013 crawl we will receive a discrete set of data from .govt.nz. This we hope to be able to make available as a discrete data set.

“We are still undertaking work to determine the Library's position on the delivery of certain parts of the .nz domain, eg hate and porn material.

“The Library is also interested in preserving the wider participation of New Zealanders in the internet. We have been able to source a copy of the Geocities [forum], founded in 1994, and we hope to find a lot of interesting information related to New Zealand and New Zealanders in this early internet environment.”

Join the Computerworld New Zealand newsletter!

Error: Please check your email address.
Show Comments
[]