SPEECH RECOGNITION
Archive metadata summarises the most important elements of a programme. Speech recognition, on the other hand, provides a record of every word spoken. As such, it can help to find programmes that would otherwise be missed, and to do data science analysis of the texts - for example, which words are most frequently used. The Netherlands Institute for Sound and Vision is currently busy processing archive material with a speech recogniser, to produce transcripts of Dutch speech in the programmes. On this website, statistics are presented on the availability of speech recognition transcripts for radio and television material. This page discusses the background of speech recognition, while the Speech recognition Radio & TV page shows an overview of availability.
For questions: rordelman@beeldengeluid.nl or mwigham@beeldengeluid.nl, or for Clariah users, go to http://mediasuite.clariah.nl/contact
Clariah users can search in the speech transcripts in the Media Suite
Frequently asked questions
What is ASR? Automatic Speech Recognition is technology with which you can recognise which words have been said in an audio fragment. If it has been trained for a specific voice, then it can be very accurate. In our material, howevere, there are many voices present, together with other noises. As a result, errors can sometimes occur, e.g. "De werden verdachte omstandigheden en gezien waarna hij direct een onderzoek zijn gestart ". Even so, the quality is quite good: "Ooggetuigen meldden dat er is geschoten maar door wie is onduidelijk volgens het persbureau reuters komen betrokkenen oorspronkelijk uit centraal azië een politiebron zegt dat de ruzie uitbrak tussen groepen mensen die bij de begraafplaats spullen verkopen. ".
Why ASR? There is more chance of finding a relevant programme when we have a transcript of the spoken text. You can search in the metadata that describes a programme - genre, summary, title etc. In this way, you can find a lot of results. But the metadata can never describe the entire content of a broadcast. You can find much more with a transcript of the audio. For example, when a politician made a particular statement, or when a given person is mentioned during a programme. In addition to this, ASR makes it possible to not only find material, but to analyse it too. This is very useful for research. For example, how many emotionally charged words are used during a debate? Do we use different words now when discussing immigration than in the past?
Why not just use 888? Subtitles are subject to intellectual property laws, which limits their access and use. We are working on obtaining access to this information. However, not all programmes are subtitled, and subtitles are only available from 2011 onwards. So ASR will always remain a useful source of information.
How does the speech recognition process work? We search our archive and select material according to its priority (for example, news and current affairs are high priority). The related audio must already be digitally available. This audio is copied to a server, where the recogniser is running. Recognising so much material is a length process, requiring a lot of processing power. The recognised words are stored as text, and made available to researchers in the Media Suite database.
How can l search the transcripts? You must have a log-in for the Media Suite. Here you can search in the programme metadata - summary, genre, participants etc., and also in the ASR transcripts. Researchers and students at Dutch universities can log on via their user account at their institution.
How many hours of material still need to be recognised? The statistics on this site show how many programmes have been recognised, but not how many hours of material. A programme can be a few minutes, a few hours, or even (for extreme examples such as radio marathons) more than a day. We have a record of how many hours have been processed, but we do not know how many hours are still waiting, as information about the programme length is only available for about 60% of the material in the archive.
Is ASR the same as speaker-labelling? No. Speaker-labelling is also an automatic proces that analyses the audio. But speaker-labelling shows who has said something. ASR shows what they said. So when Rik van de Westelaken says, 'Welkom naar Eenvandaag. Vanavond in de uitzending spreken we met Mark Rutten', then speaker-labelling recognises that Rik van de Westelaken is speaking, while ASR recognises the text. So, you can find this fragment by searching for 'Rik van de Westelaken' in the speaker-labelling, or by searching for 'Mark Rutten' in the ASR. Sound and Vision uses a commercial service to label speakers in content as it is ingested into the archive.
What about other languages? At present, we only process the audio with a Dutch speech recogniser. Other languages will be processed, but incorrectly recognised as Dutch words.
What about privacy? We carry out speech recognition in accordance with the Dutch privacy laws (AVG). Transcripts are stored carefully and are only available to researchers and Sound and Vision staff.