Name: meemoo
Price range: $

In the GIVE project, meemoo set about tackling audiovisual archival materials from cultural and government organisations. Over two and a half years, they significantly enriched descriptions from a vast archive using AI technologies: facial recognition, speech recognition and entity recognition. The kind of metadata we are talking about includes, for example, titles, personal names, dates and locations.

A view of the set

From KU Leuven to Opera Ballet Vlaanderen, Liberas and the Poperinge Archive, the GIVE project added descriptive data to audio and visual content for around 130 cultural and governmental partners.

Facial recognition technology delved into 124,000 hours of video content. And speech and entity recognition tools were also applied to this content along with a further 46,000 hours of audio-only content, making more than 170,000 hours of video and audio more searchable in total.

Facial detection and recognition

Let’s start with automated facial recognition. This technology was deployed in the GIVE project to identify public figures in videos with minimal human intervention.

The facial recognition pipeline used was not purchased but built in-house due to our very specific requirements, the relatively high cost of commercially available tools, and experience already gained in the FAME project. Developing a facial recognition pipeline in-house brings many advantages, such as offering more freedom and ensuring control over privacy and the ethical handling of sensitive data.

Facial recognition in GIVE: how does it work?

In the GIVE project, facial recognition was applied exclusively to moving images. The detection and recognition of numerous faces followed this process:

Visual stappenplan metadateren horizontaal finaal 1

1. Face detection

Before you can recognise a person, you need to identify which elements in a clip are faces. After extensive benchmarking, meemoo opted for a combination of two open-source models, YuNet and mediapipe.

2. A unique ‘fingerprint’ for each face

Next, a unique ‘fingerprint’ (or feature vector) is generated for each detected face using Magface. These fingerprints allow us to measure how similar two faces are, e.g. this unique code ensures tennis legend Kim Clijsters won’t be mistaken for actress Maaike Cafmeyer. This step is also known as face embedding.

It's crucial that the images of the faces are clear and of high quality for the facial recognition process to be effective. A face turned away wearing a helmet scores poorly, while a face looking directly at the camera with good image quality scores much better. Take a look at the model below:

3. Tracking faces

With information on where in the video the face was detected, along with the fingerprints, faces can be tracked across multiple frames within the same shot. This process, known as face tracking, aims to group consecutive faces of the same person within a shot.

It looks something like this:

4. Grouping faces

A person can be followed within a shot, but they may also appear again later in the video. In step four, the various appearances of a person in a video are therefore grouped into a cluster. This combination forms an individual. The system also indicates where in the video the person appears.

For individuals who appear in many shots in the video, the three best faces are selected. If you didn’t edit this here, you’d have to keep track of thousands of faces – which is too much information to manage!

5. Facial recognition! Or not?

In the final step, the three representative faces of a person from a video are compared with a reference list. This is a set of images of public and relevant individuals, compiled by the content’s custodians for the GIVE project.

The actual recognition process works by comparing the unique ‘fingerprint’ of the individual from the video with the ten most similar fingerprints from the reference set.

If the fingerprints are sufficiently close, and the individual from the video corresponds to a single person from the reference set…
It’s a match! The detected face is linked to the photos – and thus the person – from the reference list, and face recognition is achieved. At this point, an identity is assigned to the detected face, and you generate new metadata! The probability of the recognition is also calculated: how confident are we of the match?

If the distance is too great…
Then there’s no match. The person in the video is not yet in the reference set and therefore cannot be matched. The story doesn't end here, however, as these unknown individuals are compiled across various videos using their unique ‘fingerprint’. In the future, all organisations involved in the GIVE project can continue to add individuals to the reference set, so the list keeps growing, and more recognised individuals are added.

A concrete case! Person X appears in 34 videos from 12 different organisations. When one of those organisations names person X as Guy Verhofstadt, then everyone receives new metadata in those 34 videos.

Did you know? The face detection phase takes several months. Face recognition, on the other hand, Takes a day at most!

A well-stocked reference list for culture and government

As mentioned above, to perform face recognition, you need an extensive list of individuals you actually want to recognise. These are typically public figures or individuals who are highly relevant in the context of a specific archive. But that’s not all: you also need portrait photos of these individuals. And that's quite a task! Fortunately, meemoo was able to build on the FAME research project (2020-2022). Institutions that manage collections – such as the Flemish Parliament, KOERS (Museum of Cycle Racing), ADVN (Archive for National Movements), Kunstenpunt (Flanders Arts Institute), Amsab-ISG (Institute of Social History), KADOC (Documentation and Research Centre on Religion Culture and Society from KU Leuven) and Liberas (Liberal Archive) – contributed to FAME by collecting reference photos, metadata sources and existing reference sets.

In the GIVE project, the 130 owners of the enriched archive content further supplemented this set, although this was not done without guidelines: to ensure everything was conducted ethically, the processes were discussed in detail with all stakeholders. In conclusion, It is still a human (that is, the archivist) who is in control. Also, the set is under shared management, which ensures a rich reference set and, over time, more metadata. An individual is not just recognised within the confines of one archive, but immediately across videos from some 130 archives.

Extra enriching? Where possible, each individual in the set was linked to the content manager’s internal data sources and to an authentic source or thesaurus, such as Wikidata, to provide additional information and therefore enrich the metadata even further.

The set currently comprises about 3,800 individuals, and organisations can continue to add individuals they wish to recognise in the future.

Seen often, but not recognised

The results are impressive! A staggering 3.3 million faces were detected, of which 225,000 matched to a person from the reference set. This translates to 2,518 uniquely recognised individuals – impressive figures and a significant leap forward! And, for the remaining 3.1 million faces that remain anonymous for now, there’s a greater chance that a richer reference set will lead to additional faces being named moving forward. A list of the most frequently occurring unknown faces in the video content is also being compiled, which the project partners can use as a foundational document.

Speech recognition

The second AI technology employed to enhance Flemish archival content was speech recognition. Speech within audio clips and videos was automatically transcribed in various languages, so you can quickly grasp what a political debate or documentary is about.

Given that speech recognition technologies are already quite advanced, cost-effective and rapidly evolving, meemoo opted to purchase a service. Unlike facial recognition, meemoo did not develop the speech recognition pipeline in-house, although the chosen system was integrated into the meemoo architecture. But how did this selection process unfold?

A labour-intensive preliminary phase

Arriving at the right choice requires a significant amount of work. The preliminary phase was the most labour-intensive part of the speech recognition aspect within the GIVE project, taking a year and a half to complete.

165 manual transcriptions
To make an objective comparison of different tools, we compared the results from various existing services with manually transcribed materials. We hand-selected three hours of audio according to certain categories and then had it transcribed by an external agency. These categories were:
- a radio or television interview
- a political debate or interview
- spontaneous commentary, such as for a sports event
- a reportage or documentary
- a news bulletin
- a performing arts clip
- a clip featuring dialect
- old archival material (time indication?)
- archival material in another language
Key terms such as the names of organisations, individuals, locations and a set of keywords were also annotated in each clip. And we were even able to use this data later in the entity recognition component, for selecting an external service.
Launching a public tender

Following this, meemoo launched a public tender process, which involved objectively comparing the quality and price (per hour of transcribed content) of different solutions. Meemoo was then able to make a purchasing decision.
A benchmarking tool

The open-source tool EBU-BenchmarkSTT compared all solutions against each other. It measured the quality of the various speech recognition solutions by comparing them to the reference material (the manual transcriptions) using the so-called word error rate. Other qualitative features and the price also played a role in the final decision.

Moving forward with Speechmatics

At the end of this process, meemoo chose the SaaS application (short for Software-as-a-Service) from Speechmatics. This speech recognition service first determines the language being spoken and then converts the speech into text. The tool also detects which word is said when in the video or audio clip, and how confident the software is that the word has been heard correctly. A written transcription with time stamps in several file formats, such as a subtitle file (.srt), can then be generated lightning-fast. The machine transcribes one hour of archival material in a matter of minutes – which would take several hours to do manually. After selecting Speechmatics, the service was integrated as a link in a broader process within the architecture of the meemoo archival system.

Speech recognition is not without its challenges, of course: the AI application often struggles with recognising dialects or words that are no longer in use today, or transcribing segments where multiple languages are present. In the latter case, only the main language is detected, which can lead to some peculiar results!

Speech recognition software is evolving at an incredibly fast rate. We noticed a measurable improvement in the space of just a few months, as well as an additional feature: an automatic language detection step that selects the correct language model for the actual transcription.

Detailed transcriptions in multiple languages

Entity recognition

The speech recognition results lead us seamlessly to the next phase: named entity recognition (NER). We applied this software to the AI-generated texts from the speech recognition process, described in detail above. Texts brimming with ‘entities’: names of public figures, organisations, places and potentially more. All this information is being identified and collected in the final part of the GIVE metadata project.

For entity recognition, we decided to purchase an application. The process was similar to the selection of the speech recognition service. As described above, in order to make an objective choice, meemoo first manually transcribed three hours of audio. Then, all the key information was highlighted in a final step so it could be used to inform the choice of NER application.

After a thorough comparison between SpaCy, FLAIR, mBERT, Amazon Comprehend, Azure Language, Google Cloud Natural Language and Textrazor, the latter SaaS application emerged as the best. In addition to identifying entities such as people, organisations and locations in various languages, the tool also provides links to thesauri, such as Wikidata, where possible – which is a significant advantage!

Authentic sources for more metadata

In the projects focused on facial and entity recognition, we used authentic sources wherever possible. These databases are full of genuine information that serves as the reference point: Wikidata, VIAF or Geonames, for example. When you link the newly generated metadata to such an authentic source, you connect to an element that is richer and can contain other interesting metadata, such as a person's date of birth, the political party they were active in, or the precise coordinates of the town where they lived.

Authentic sources also help to make the metadata unambiguous, preventing confusion between aliases or incorrect spellings of a person’s name. Thanks to the additional information, you know it refers to the cyclist Eddy Merckx, for example, and not the billiards player with the same name!

What about the future?

Sustainable preservation

First things first: all generated metadata is sustainably stored in meemoo’s archival infrastructure. This storage is subject to several agreements:

it is marked as ‘generated by meemoo via AI’
the origin is meticulously recorded:
- At what time?
- Using what software?
- In which project and by whom?

Because this new metadata is very detailed, abundant and in turn contains information about the metadata itself, meemoo does not use its archival system for storage. Instead, the metadata is stored directly in a knowledge graph, with a link to the item in the archival system. A knowledge graph is an application that looks like an extensive network of connections and serves to uniformly consolidate knowledge.

Access and reuse

The primary focus of the GIVE metadata project is on the creation of metadata. Access and reuse are not immediately on the agenda but will be explored later together with meemoo’s content partners.

Sustainable processes

The processes used in the project will continue to exist in the future. And the archives of several major media companies will also be addressed! In the Shared AI project, personal names, entities and transcriptions will be added to the archives of VRT and all Flemish regional broadcasters – based on roughly the same workflows as in the GIVE project.

But the story doesn’t end here. Organisations that manage collections can continue to use their expertise to better describe archival content, and the world of artificial intelligence is constantly evolving. New technologies are always emerging and their quality is improving. Artificial intelligence holds great promise for annotating large volumes of data, and meemoo is eagerly anticipating it!