Early days in web harvesting

Industry reaction to government plans to collect electronic data for the National Library hinges on how detailed its requirements turn out to be.

Industry reaction to government plans to collect electronic data for the National Library hinges on how detailed its requirements turn out to be.

The National Library Bill, currently going through the select committee process and due for enactment this year, extends the current “legal deposit” scheme of book publishers to include websites.

At present, book publishers must give the National Library up to three copies of a published book to be preserved for posterity. The new bill will allow the library to demand data from electronic publishers, threatening them with fines of up to $5000 if they do not comply.

National Library digital records coordinator Steve Knight says the library has already begun collecting data from a dozen sites for research purposes, but has yet to decide on a method for much wider “harvesting” beginning later this year.

Knight says the library has studied a targeted approach in Australia and a Swedish method where data is stored first for sorting later. He stresses not everything will be collected from everybody, but rather “snapshots of what would become New Zealand history”.

For major online publishers, this could include making an arrangement for regular data collections, perhaps quarterly. Some personal or family websites, maybe even CVs, will also be collected, as these too would be of interest to future historians.

The National Library presently has three terabytes of storage for its existing digitisation programme. But, Knight says, the library has to look at issues concerning how to manage complex metadata, and what tools to use for scouring the .nz name space and other New Zealand-related websites, a process likely to take six months.

“We are at the study and research phase at the moment. The costs over time are very unclear, as there are no models about the sustainability of web harvesting. We are doing this out of baseline funding,” he says.

Knight expects a zero or “minimal” cost to publishers, regardless of whether they have to send data to the library or the library “pulls” it from them.

“If we get the harvesting right, there should be no overheads to the publishing community. The onus is on us to say what will be coming in. We will respect intellectual property rights. We do ourselves a disservice if we did not. The $5000 fine has only been used once as a last resort,” he says.

Peter Fowler, manager of Moonbase Media, which publishes the Newsroom website, says there is no problem with the plan as long as compliance is easy and it does not affect his firm’s paid-for services.

Guy Macgibbon, news editor of the Scoop online news service, says he has no problem with the library collecting Scoop content, so long as there aren’t cumbersome compliance requirements.

“It’s unworkable to provide a copy of everything we publish. If the library is interested in a copy, [it can] take down the URL and make a copy of the website. We do have a free public access website, which we have every intention of maintaining. But we are not in the business of providing content to the National Library,” Macgibbon says.

TVNZ spokesman Glen Sowry says the SOE also has no issue with providing data from its Nzoom.com website so long as the burden of providing it is not great.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags National library

More about Bill

Show Comments