Digitising newspapers on a large scale involves many considerations. Which copies do you select from millions of non-digitised newspapers? What about the paper’s acidification, which makes newspapers so fragile? And how do you make everything searchable? Meemoo and Flanders Heritage Library are digitising 630,000 newspapers as part of the GIVE project. You can read all about the entire process, from A to Z, here.
Choosing a relatively manageable set from millions of newspapers that have not yet been digitised was the first step in this mass digitisation project. The Flanders Heritage Library non-profit organisation used several criteria to solve this puzzle for the GIVE newspaper project, ‘Primeur’, with a special focus on the most vulnerable (i.e. acidified) copies because they will deteriorate first. You can read more about these criteria here.
With our selection in hand, it was time for the collection managers to register and package the materials. We thoroughly inspected the data for each newspaper bundle before recording it in an online registration system. After any necessary repackaging, we used a unique barcode to link each copy to this digital record. A whole bunch of registrars dedicated over 2,000 hours to this phase, handling 9,600 newspapers per week on average!
What information was recorded? Both content-related and technical data about the newspaper editions found their way into the online system. And that’s handy because technical metadata, such as newspaper's dimensions, streamline the digitisation process. Content-related metadata, such as the title, language used and year of publication, enhance the content’s searchability.
We also paid attention to recording any damage to the vulnerable newspapers. The paper that newspapers were printed on in the past was not meant to last forever.
It is naturally self-destructive and particularly susceptible to acidification, an unavoidable and irreversible process that eventually makes many newspapers so brittle that they simply fall apart – even in good storage conditions! Everyday handling can also damage the paper. The registrars frequently spotted tears, ink stains, creases and folds. And then there are the inevitable paper-eating silverfish and mould. Archives take necessary measures against these natural enemies, but it is very difficult to prevent at least some damage.
Registering these damage characteristics is crucial to ensuring a smooth digitisation process and gaining a good overview of the material condition of the newspaper heritage across organisations. It is also important for assessing the quality of the later optical character recognition (OCR) – the conversion of digital images into machine-readable and searchable text.
Every newspaper that is transported to the digitisation company must be securely packaged to avoid any damage to this vulnerable medium. These fragile carriers are therefore packaged in three layers that together form an additional protective shield. In the registration phase, any loose editions or bundles without covers are placed in a new folder made from acid-free paper, tailored to the respective dimensions. They are then placed together in a sturdy box, which in turn is placed in a plastic container. Ready for shipping!
Picturae, a specialist digitisation company based in the Netherlands, is overseeing the digitisation process in the GIVE newspaper project, Primeur. As the initiator and coordinator of the GIVE project, meemoo chose this company following a thorough selection and comparison process.
The 630,000 newspapers from eight preservation institutions were divided into smaller quantities to ensure that the storage space at Picturae did not become overcrowded and everything remained well-organised and manageable.
The goal of the GIVE newspaper project is to create faithful digital copies of thousands of newspaper pages, and to do so quickly. Picturae used two different digitisation set-ups for this, depending on the newspaper’s format:
Did you know… Picturae built this set-up specially for the GIVE project
a set-up with one camera
One correctly calibrated camera is positioned above this set-up, perpendicular to both pages of an opened newspaper. The camera then captures both pages simultaneously in a ‘bifolio’ shot. We only used this method for smaller newspapers in the GIVE project – to ensure quality requirements are maintained and keep the number of pixels per page high enough.
The advantage of this working method is that dividing the newspapers into two sets saves Picturae a lot of time and allows them to choose the most suitable digitisation method depending on each newspaper’s format. And that’s vitally important in a mass digitisation project such as this! We were able to make 1,400 captures a day on average thanks to this method.
The aim of the GIVE project is to create high-quality digital copies, preserve them for the future, and make them reusable for various applications. The step-by-step process for digitising newspapers is as follows:
Correctly set up the camera or cameras;
Carefully remove any dust or dirt as needed;
Place the newspaper on a table with moving surfaces. These surfaces help to maintain balance, especially for large bundles, reducing pressure on the spine of the newspaper;
Use a laser pointer to determine the centre alignment of the newspaper;
Lower a glass plate to neatly flatten the newspaper;
With a simple press of a button, the capture is done!
The operator wears a special glove to turn the pages;
Once all pages are captured, the newspaper is packed up again and safely stored back in the box;
Picturae reviews the metadata and makes additions, such as the type of camera used for the capture or the software.
The digitisation process follows the strict Metamorfoze guidelines, even though this set of standards for digital photography presented some challenges. What did it mean for this project?
Every day, meemoo and Picturae set and extensively test all technical requirements, including the correct aperture, resolution per dimension, white balance and tonal scale. This is done using targets: cards and scales used to check the camera settings. Metadata about the digitisation process is also added, and the physical object is linked to the digital capture. Depending on the size of the physical newspaper, the resolution is set at 300 PPI. The captures are also automatically cropped. After each capture, the operator checks that the image is straight, sharp and not excessively cropped.
What is a target? This crucial measuring instrument for colour, lighting, white balance, resolution and more consists of standardised colour cards and tonal scales. We ensure that everything is done correctly by digitally capturing these targets and comparing any deviations with the reference values. They are the ideal allies for consistency throughout a digitisation project. The targets Delt.ae and GIMP were used in the GIVE newspaper project
Newspapers are delicate carriers, which presents challenges. We therefore paid close attention to safe registration and transportation. During the digitisation process, it was important not to put too much pressure on the spine of the bundles. Picturae solved this issue by using a special table with different moving surfaces. Turning the pages was also done with great care. Acidic paper can crumble under pressure, so we distributed the pressure as evenly as possible, and held and turned the newspaper pages with both hands to reduce the risk of tearing.
High-quality digital images of thousands of newspaper pages: check! But then what? The next step in the GIVE project is to apply a technical feature called optical character recognition (OCR). This AI technique makes text computer-readable. It is particularly useful for a medium such as newspapers, as it greatly improves searchability and accessibility.
A dive into the process:
The true-to-life digital image of the newspaper is first enhanced as much as possible. Contrast and brightness are adjusted, and any noise is removed, to make each letter as readable as possible.
A dictionary is consulted in step three. Are the recognised characters actual words? This step also affects the probability score.
Next, a final spell check is performed. Many newspapers in Primeur are quite old, so historical dictionaries are consulted.
Before the digital files flow into meemoo’s archive system, checks are run to make sure all files meet the required quality standards. In addition to daily checks of the targets with Open Dice, meemoo randomly samples the supplied files to ensure completeness, and performs a visual check. Are the images sharp? Has too much been cropped? Are there any fingers in the frame? Then the files are checked – in TIFF format – for content and structure using DPF Manager. The correct TIFF profile is crucial for their long-term preservation, and Baseline 6.0 uncompressed was chosen in the GIVE newspaper project.
The OCR results (according to the ALTO standard) are validated using an XSD schema. Is the file correctly formatted? Is everything included in the file? How readable is it? A disclaimer is also added to this new content data, as errors can sometimes occur.
The metadata created by Picturae (in METS XML) is finally checked automatically: are the digital packages, and their required specifications, complete? After this, the files can flow into meemoo as a single SIP package, where they are sustainably archived. The newspapers can then gradually be made accessible by meemoo and the partners in this project – so that everyone can eventually enjoy this wealth of information.
This project was made possible with support from the European Regional Development Fund and is part of the Flemish Government’s Resilience Recovery Plan.