The National Library will be conducting its second “web harvest” of New Zealand websites, from May 12 to 25.
This is an effort to collect a snapshot of all public data from as many URLs as possible in the .nz domain, along with .com, .net and .org sites that appear to be located on servers in New Zealand.
The harvest will also access “selected websites that are owned by New Zealanders or otherwise considered New Zealand publications by the National Library of New Zealand Act (2003),” a Library briefing note says.
The first web harvest was conducted in 2008.
The Library sees web harvesting as a logical extension of its role of collecting physical textual and pictorial material that reflects the country’s social history.
The exercise will be done on the Library’s behalf by the Internet Archive, a US-based not-for-profit organisation. It will not attempt to probe into password-protected areas of sites and will respect the robots.txt file that guards portions of a site against automated access.
“However, we will always take a copy of a website’s homepage, regardless of the robots.txt rule,” says the Library.
The National Library expects to capture data from 130-140 million URLs, for a total of 7 to 8TB of uncompressed data.