The National Library is to digitise important parts of its newspaper collection following a successful pilot of optical character recognition (OCR) software.
As well as making the collection more conveniently searchable, OCR saves storage space and frequently improves legibility, says developer Gordon Paynter.
By the end of June, the library plans to have introduced a production version of the OCR acquisition software as well as a public interface and search capability, on a selection of newspapers, which will eventually cover 5% of the library’s complete holdings.
Choosing what to make available in this format is largely “a curatorial decision”, says Paynter. The library will first prioritise material of most value to researchers.
The image processing part of the project has been contracted to Planman Consulting, based in San Diego and New Delhi, in partnership with German consultancy CCS. They will use the Docworks software, which puts a workflow process around the recognition activity. The recognition software itself is the popular Abbyy Fine Reader.
The other half of the project is the online delivery system and this will use Greenstone library software, an open-source development. This will be built by Hamilton company DL Consulting.
The recognition process automatically “zones” pages into article text, illustrations and advertisements. The keyword search capability that will be put over it is based on Lucene, another open-source product, from the developers of Apache. “That lets you do ‘fuzzy’ search,” Paynter says, finding an article even if the search term is slightly different from the printed keyword. This may arise as a result of misreading by the OCR engine.
A specialist dictionary has been built with Maori words and New Zealand place names, but there is bound to be a small proportion of misreading, Paynter says, and the scanned pages carry a disclaimer to that effect.
In line with open-source principles, all development will be contributed back to the Greenstone project.
The pilot project has already digitised and OCRed 100,000 pages. A further 75,000 will be added this year. Another 960,000 pages have been digitised but not yet OCRed.
These were posted online from 2001 onwards in the library’s “Papers Past” project, but as large-volume .tif image files with a very limited search capability.
The user interface will be developed to make it compliant with the government’s web guidelines, Paynter says.