with the support of
Digitising 630,000 newspaper pages – a first!

Digitising newspapers: from A to Z

Digitising newspapers on a large scale involves many considerations. Which copies do you select from millions of non-digitised newspapers? What about the paper’s acidification, which makes newspapers so fragile? And how do you make everything searchable? Meemoo and Flanders Heritage Library are digitising 630,000 newspapers as part of the GIVE project. You can read all about the entire process, from A to Z, here.

Thorough selection and preparation

Choosing a relatively manageable set from millions of newspapers that have not yet been digitised was the first step in this mass digitisation project. The Flanders Heritage Library non-profit organisation used several criteria to solve this puzzle for the GIVE newspaper project, ‘Primeur’, with a special focus on the most vulnerable (i.e. acidified) copies because they will deteriorate first. You can read more about these criteria here.

With our selection in hand, it was time for the collection managers to register and package the materials. We thoroughly inspected the data for each newspaper bundle before recording it in an online registration system. After any necessary repackaging, we used a unique barcode to link each copy to this digital record. A whole bunch of registrars dedicated over 2,000 hours to this phase, handling 9,600 newspapers per week on average!

  • UGENT 2022
    Registration at UGhent, photo by meemoo, licence: CC BY-SA
  • DSCN2327
    Newspapers at Picturae, photo by meemoo, licence: CC BY-SA

Meticulous registration

What information was recorded? Both content-related and technical data about the newspaper editions found their way into the online system. And that’s handy because technical metadata, such as newspaper's dimensions, streamline the digitisation process. Content-related metadata, such as the title, language used and year of publication, enhance the content’s searchability.

DSCN2347
Acidic newspaper at Picturae, photo by meemoo, licence: CC BY-SA

We also paid attention to recording any damage to the vulnerable newspapers. The paper that newspapers were printed on in the past was not meant to last forever.

It is naturally self-destructive and particularly susceptible to acidification, an unavoidable and irreversible process that eventually makes many newspapers so brittle that they simply fall apart – even in good storage conditions! Everyday handling can also damage the paper. The registrars frequently spotted tears, ink stains, creases and folds. And then there are the inevitable paper-eating silverfish and mould. Archives take necessary measures against these natural enemies, but it is very difficult to prevent at least some damage.

Registering these damage characteristics is crucial to ensuring a smooth digitisation process and gaining a good overview of the material condition of the newspaper heritage across organisations. It is also important for assessing the quality of the later optical character recognition (OCR) – the conversion of digital images into machine-readable and searchable text.

Packaging and transportation

Every newspaper that is transported to the digitisation company must be securely packaged to avoid any damage to this vulnerable medium. These fragile carriers are therefore packaged in three layers that together form an additional protective shield. In the registration phase, any loose editions or bundles without covers are placed in a new folder made from acid-free paper, tailored to the respective dimensions. They are then placed together in a sturdy box, which in turn is placed in a plastic container. Ready for shipping!

Mass digitisation?

Picturae, a specialist digitisation company based in the Netherlands, is overseeing the digitisation process in the GIVE newspaper project, Primeur. As the initiator and coordinator of the GIVE project, meemoo chose this company following a thorough selection and comparison process.

The 630,000 newspapers from eight preservation institutions were divided into smaller quantities to ensure that the storage space at Picturae did not become overcrowded and everything remained well-organised and manageable.

Digitising with two sets

The goal of the GIVE newspaper project is to create faithful digital copies of thousands of newspaper pages, and to do so quickly. Picturae used two different digitisation set-ups for this, depending on the newspaper’s format:

Visual kranten unifolio
  • a set-up with two cameras
    Two accurately calibrated cameras are positioned above this set-up, each aimed at one page of the newspaper. The cameras then take a picture of one page each, which is called a ‘unifolio’ shot and has the advantage that even very large copies are captured in good quality.

Did you know… Picturae built this set-up specially for the GIVE project

Visual kranten bifolio
  • a set-up with one camera
    One correctly calibrated camera is positioned above this set-up, perpendicular to both pages of an opened newspaper. The camera then captures both pages simultaneously in a ‘bifolio’ shot. We only used this method for smaller newspapers in the GIVE project – to ensure quality requirements are maintained and keep the number of pixels per page high enough.

The advantage of this working method is that dividing the newspapers into two sets saves Picturae a lot of time and allows them to choose the most suitable digitisation method depending on each newspaper’s format. And that’s vitally important in a mass digitisation project such as this! We were able to make 1,400 captures a day on average thanks to this method.

Deep dive: how do you digitise a newspaper?

The aim of the GIVE project is to create high-quality digital copies, preserve them for the future, and make them reusable for various applications. The step-by-step process for digitising newspapers is as follows:

  • Correctly set up the camera or cameras;

  • Carefully remove any dust or dirt as needed;

  • Place the newspaper on a table with moving surfaces. These surfaces help to maintain balance, especially for large bundles, reducing pressure on the spine of the newspaper;

  • Use a laser pointer to determine the centre alignment of the newspaper;

  • Lower a glass plate to neatly flatten the newspaper;

  • With a simple press of a button, the capture is done!

  • The operator wears a special glove to turn the pages;

  • Once all pages are captured, the newspaper is packed up again and safely stored back in the box;

  • Picturae reviews the metadata and makes additions, such as the type of camera used for the capture or the software.

Technical details

The digitisation process follows the strict Metamorfoze guidelines, even though this set of standards for digital photography presented some challenges. What did it mean for this project?

Every day, meemoo and Picturae set and extensively test all technical requirements, including the correct aperture, resolution per dimension, white balance and tonal scale. This is done using targets: cards and scales used to check the camera settings. Metadata about the digitisation process is also added, and the physical object is linked to the digital capture. Depending on the size of the physical newspaper, the resolution is set at 300 PPI. The captures are also automatically cropped. After each capture, the operator checks that the image is straight, sharp and not excessively cropped.

Targets kranten

What is a target? This crucial measuring instrument for colour, lighting, white balance, resolution and more consists of standardised colour cards and tonal scales. We ensure that everything is done correctly by digitally capturing these targets and comparing any deviations with the reference values. They are the ideal allies for consistency throughout a digitisation project. The targets Delt.ae and GIMP were used in the GIVE newspaper project

Challenges posed by fragile paper

Newspapers are delicate carriers, which presents challenges. We therefore paid close attention to safe registration and transportation. During the digitisation process, it was important not to put too much pressure on the spine of the bundles. Picturae solved this issue by using a special table with different moving surfaces. Turning the pages was also done with great care. Acidic paper can crumble under pressure, so we distributed the pressure as evenly as possible, and held and turned the newspaper pages with both hands to reduce the risk of tearing.

  • BHL 20220311 07
    Newspaper of Bibliotheek Hasselt Limburg, photo by meemoo, licence: CC BY-SA
  • DSCN2302
    Picturae, photo by meemoo, licence: CC BY-SA

What happens after digitisation?

Focus on usability: OCR

High-quality digital images of thousands of newspaper pages: check! But then what? The next step in the GIVE project is to apply a technical feature called optical character recognition (OCR). This AI technique makes text computer-readable. It is particularly useful for a medium such as newspapers, as it greatly improves searchability and accessibility.

Pa cta text schermafbeelding 2023 04 21 om 133009
OCR applied to the Vooruit socialist newspaper, 25/09/1914, via nieuwsvandegrooteoorlog.hetarchief.be/en

A dive into the process:

  • The true-to-life digital image of the newspaper is first enhanced as much as possible. Contrast and brightness are adjusted, and any noise is removed, to make each letter as readable as possible.

  • Now that each letter is made as readable as possible, it’s time to run the text recognition on the digital images.
    • This technology within artificial intelligence looks for blocks and then recognises the characters within those blocks.
    • The likelihood of correct character recognition is also calculated, assigning a probability score to each letter.
  • A dictionary is consulted in step three. Are the recognised characters actual words? This step also affects the probability score.

  • Next, a final spell check is performed. Many newspapers in Primeur are quite old, so historical dictionaries are consulted.

  • The OCR data is generated. For each letter, you can see which letter is recognised, the font and size it appears in, and the certainty of the recognition.

Time for quality control

Before the digital files flow into meemoo’s archive system, checks are run to make sure all files meet the required quality standards. In addition to daily checks of the targets with Open Dice, meemoo randomly samples the supplied files to ensure completeness, and performs a visual check. Are the images sharp? Has too much been cropped? Are there any fingers in the frame? Then the files are checked – in TIFF format – for content and structure using DPF Manager. The correct TIFF profile is crucial for their long-term preservation, and Baseline 6.0 uncompressed was chosen in the GIVE newspaper project.

The OCR results (according to the ALTO standard) are validated using an XSD schema. Is the file correctly formatted? Is everything included in the file? How readable is it? A disclaimer is also added to this new content data, as errors can sometimes occur.

The metadata created by Picturae (in METS XML) is finally checked automatically: are the digital packages, and their required specifications, complete? After this, the files can flow into meemoo as a single SIP package, where they are sustainably archived. The newspapers can then gradually be made accessible by meemoo and the partners in this project – so that everyone can eventually enjoy this wealth of information.

Partners

Logo VEB2 20181230 RGB COLOR
Default advn
Default logo amsab 72dpi 2018
Logo bhl
Default erfgoedbibliotheek hendrik conscience
Default kuleuven logo
Default kuleuven kadoc rgb logo
Default stadslogo bib
Boekentoren

This project was made possible with support from the European Regional Development Fund and is part of the Flemish Government’s Resilience Recovery Plan.

Pa page width efro eu klein
On this page: