A new skill for a new age
By Andrew Lambourne (SysMedia Ltd (UK))
Abstract & Keywords
La produzione di sottotitoli mediante rispeakeraggio esiste solo dal 2001, ma in questi pochi anni strumenti e professionalità si sono evoluti in modo significativo. Questo articolo fornisce una breve panoramica sulle tecniche usate e sulle maggiori perplessità suscitate inizialmente da questo nuovo strumento di sottotitolazione. Passa poi in rassegna le tecnologie disponibili e i risultati conseguiti. In tutto il mondo esistono già numerosi servizi che dispongono di programmi dedicati e operatori esperti, ma per molte lingue questo servizio non è ancora possibiledisponibile, o la sua qualità è di livello inferioreinadeguato. Che cosa si può fare e che cosa si sta facendo su questo versante? L’articolo analizza che cosa si può fare e che cosa si sta facendo su questo versante e getta infine uno sguardo su unl futuro, dove in cui si prevede una crescente utilizzo di tecnologie del linguaggio naturale per la sottotitolazione e dove il respeaker potrebbe cominciare a operare anche in altri ambiti dell’elaborazione testuale.
Subtitle respeaking has been in use since 2001, and over the past five years the skills and tools have developed significantly. This article provides a quick review of the technique as it has evolved, with an outline of the main concerns at the beginning of the respeaking revolution. It will also review the tools and engines that are available, and the results that are being achieved. Many live subtitling services around the world rely heavily on the skills of the respeaker and the capabilities of the software that is available today. Yet it is not possible to provide this service to the same level of quality – or indeed at all – in many languages. What can be done to address this, and what is being done? This article also takes a look at the future, where speech tools are integrated more and more with subtitle production and the respeaker forms an important part not just of the subtitling landscape but perhaps in other text-processing environments as well.
Keywords: television, televisione, live subtitling, respeaking, teletext, multimedia translation, traduzione multimediale, interactive tv, deaf, sottotitoli televisivi per i sordi, respeakeraggio, tv interattiva
©inTRAlinea & Andrew Lambourne (2006).
inTRAlinea Special Issue: Respeaking
Edited by: Carlo Eugeni & Gabriele Mack
This article can be freely reproduced under Creative Commons License.
Permanent URL: http://www.intralinea.org/specials/article/1686
1. Subtitling for the deaf: challenges and requirements
The art of producing written simultaneous language translation by dictating a translated version of a live speech is not new, and provides an invaluable means of access to meetings and conferences where delegates have different native tongues. Indeed the United Nations and EU Parliament would not function without it. Now, a new skill is being increasingly demanded – the ability to dictate live subtitles to a speech recogniser in order to create prompt and accurate subtitle texts for the hearing impaired viewers of live television programmes.
Language translation (or interlingual) subtitling had been in use since the 1930s, but it was not until the 1970s that the possibility of being able to provide subtitles for the deaf and hard-of-hearing viewers was taken seriously. These subtitles would be intralingual (i.e. in the same language as the programme), but supplemented by texts conveying not just dialogue but sound effects as well. Subtitling for the deaf is audiovisual translation: conveying pictures and sound using pictures and text. Research in the US and UK, as well as elsewhere in Europe, focused on guidelines for reading rate, presentation style, content and technique, etc..
The research showed that the hearing impaired audience split into two main groups: those whose first language was the mother tongue but who had lost their hearing after acquiring spoken language skills, and those who were born deaf or deafened before acquiring language and whose first language was sign. The second group was a minority of a minority: in the UK a few hundred thousand people from a hearing impaired population of some 3-4 million. Naturally the first ‘hard-of-hearing’ group preferred subtitled texts: they could read them easily, and they used them to supplement their hearing of what was being spoken. For this reason most subtitling guidelines recommended staying close to the original spoken dialogue and editing by deletion rather than by rewording. The second ‘deaf’ group preferred sign language, and were less keen on texts because they often had less well developed reading skills, as sign language was their first and preferred language, with its modified grammatical structure.
2. New technologies, new options
The advent of the teletext and Line21 systems in the mid 1970s made it possible to provide optional subtitles for the hearing impaired, and broadcasters chose to focus their efforts on the majority hard-of-hearing group even though the technology made it possible to provide alternative, perhaps more highly edited or grammatically modified, texts for the prelingually deaf. In-vision signing was provided for special programmes, but rarely for mainstream programmes for fear of annoying the majority hearing audience for the sake of a very small minority. Much work has been done to explore the possibility of optional signing, but generally this is found to be less satisfactory than in-vision sign produced by filming a signer.
Hearing impaired people want to watch and enjoy any TV programmes just as everybody else rather than being limited to a subset. However, as subtitling services emerged, they tended at first to be restricted to the “easier” material: recorded programmes available well in advance of transmission. A subtitler could then spend 10 or 20 hours preparing and rehearsing texts which would provide very good quality results.
3. Systems for live subtitling
3.1 The Stenograph method
For programmes not recorded well in advance – for example live news or sport, live chat shows, or reality TV, it was less easy or even impossible to provide subtitles. In the US there was an available pool of Stenographers who provided court reporting services, and they were used to provide very fast (more or less verbatim) closed captions at 180-200 words per minute (wpm) for live TV shows. The US Line21 technology supports a smoothly scrolling display which was designed to make these subtitles slightly easier to read, whereas the teletext equivalent is a “jumping” scroll. In either case, verbatim reporting imposes a punishing reading load on the viewer. Nevertheless, captioning services were provided for live TV, and the Stenograph method is still used in the US and to a lesser extent by some UK and Australian broadcasters.
3.2 Stenotyping methods
In the UK, lacking a pool of Stenographers, Independent Television working with the author started by using a normal QWERTY keyboard with abbreviation codes that automatically expanded, in order to provide summary headline subtitles for major public events such as the visit of the Pope in 1982, World Cup contests and Royal weddings. This method was successful but not fast enough for news. In 1987, therefore, a Dutch syllabic keyboard called Velotype was introduced, which could be used at around 90-110 words per minute after 12 months training. With two keyboard operators in tandem and sharing the typing load, the system allowed the production of subtitles at 120-160 words per minute for news and current affairs programmes, which are normally spoken at 180 words per minute in the UK. A regular live news subtitling service using Velotype was set up for ITV in 1987: the operators initially relied on a respeaker or ‘parrot’ who listened to the programme and dictated an edited text to two keyboard operators in tandem (the ‘SUBMUX’ method), who eventually became quite well able to listen, edit and type themselves. The SysMedia WinCAPS live subtitling system still provides this option today.
3.3 Speech recognition
By 1998 it was clear that speech recognition technology should be taken seriously as a means of transcribing text, and SysMedia started researching what would be required to use speech input for subtitling. Again, quality was not quite sufficient – being around 90% accurate – so two people were used: one to speak, and one to rapidly correct subtitles before the transmission. But by the time the system had been perfected, the recognition accuracy in English had risen to 95% and the corrector was not in fact used. Services were set up using the WinCAPS SpeakTitle system by the BBC in 2001, IMS in the UK in 2002, NOS in Holland in 2004, and it is now used in Denmark, New Zealand, Australia and Germany. Trials have been conducted by SysMedia customers in other countries: for example IMS has performed tests with three broadcasters in Italy.
Speech input by now takes its place as one of the standard methods for real-time speech transcription available. Each method has different costs, speeds, accuracy levels and availability in different languages, and places different demands on the technology and operators, as shown in the synopsis in Table 1.
It is for the service provider to choose the method that suits them best, and often a combination of methods is available and deployed for different types of programmes.
In all cases, one thing is always true: creating and delivering subtitles in real time for a truly live unscripted TV programme or meeting inevitably and unavoidably involves a degree of compromise: perfection is not achievable almost by definition because there is simply not enough time in which to:
- carefully edit the text;
- transcribe it and correct any errors;
- time it to match what the speaker is saying;
- position it so as never to obscure interesting on-screen information;
- present it in a way that the viewer always has time to read it.
Both the broadcaster and the audience need to understand and accept this, and then move forward pragmatically to deliver services where compromises are kept to a sensible minimum, quality is preserved at a sensible level, and the service is produced. Having a service that is not quite perfect is deemed far better than not having one at all. The temptation always exists to renounce providing a reasonable service on the excuse that it might not be quite as perfect as subtitles for recorded programmes. This is simply an excuse and a delaying tactic, and should be exposed as such. The far better approach is to educate the audience in the challenges involved, and to explain the methods that are being used to overcome the many challenges connected with his operation.
Dual QWERTY + shortforms
up to 120-150 wpm when combined
6 months training
accuracy can be 95-98%
one or two operators
up to 140-180 wpm when combined
12 months training
accuracy can be 95-98%
Dutch, Swedish, English
one operator (if not correcting)
up to 140-160 wpm
2-3 months training
accuracy can be 95-98% depending on language
English, French, Italian, German, Spanish, Dutch, Danish
one (expensive) operator
up to 220 wpm
more than 2 years training
accuracy can be 97-98%
English, Spanish, French, Italian
Table 1: Methods for real-time speech transcription and their main characteristics
4. Challenges for real time subtitling
At present the main challenges for real time subtitling, especially for unscripted speech, are the following:
- composing adequate subtitle text for unscripted material as it is spoken (this touches on grammar, editing, time delay, factual accuracy, typographical accuracy, completeness);
- transcribing the composed texts (using one of the different technologies and techniques available, each with pros and cons);
- presenting the texts so that they can easily be read (this touches on style of presentation – e.g. scrolling or block – as well as timing);
- positioning the texts so as not to obscure relevant visual information (the commentator may be speaking really fast just at the most interesting visual point – the goal or the end of the race);
- avoiding delay to the programme broadcast (only in some cases a live programme can be aired a few seconds late to allow more time for subtitling).
The mission of SysMedia and other companies operating in this field is to assist broadcasters to overcome these challenges in the best possible way for the kinds of services they wish to provide. For example, a news programme will often contain a majority of pre-scripted material, some videotape packages, and an amount of truly live reportage or interview. WinCAPS enables the scripts to be downloaded, tidied up with respect to spelling errors and used to deliver good quality tightly synchronised subtitles when the presenter is reading scripted text. For the videotape items that can be accessed and transcribed before (or even during) transmission the subtitles can be cued out manually or even pre-timed. That leaves just the truly live material, which can be transcribed using one of the methods described.
For a live political talk-show, the challenges are even greater, and the method of choice may involve sharing the load between more than one operator. For example, one respeaker may focus on one speaker and a second on another speaker to make the workload more manageable. Often an editorial decision to ‘round down’ what is said is sensible, inevitable and pragmatic. If there is dissent regarding the transmission of a ‘less than 100% accurate’ service, the speakers and the broadcasters can be invited to decide whether they want the majority of what they say to be conveyed to the potentially 10% additional hearing-impaired viewers, or none of it.
5. Qualification requirements and selection criteria in real time subtitling
For those providing real time subtitling services, be they respeakers, Velotypists, or Stenographers, the challenges are very similar; in fact they are similar also to the questions a simultaneous interpreter asks during his or her work: “What is this person going to be talking about? Am I familiar with the vocabulary and how to spell/translate it? Can I keep up with what they are saying – and if not, what shall I do? Can I convey what is being said in a way that makes sense?” or even “Can I convert what is being said into something that makes sense?”
There is certainly a need for advance preparation – and a great benefit in doing it. Regardless of whether an operator uses typing, Velotyping, Stenography or respeaking, s/he needs to know which specialised vocabulary or proper nouns are likely to come up, and build a shortform list, dictionary or language model accordingly. Preliminary documentation and preparation are a great boost to quality. Of course if a speaker suddenly goes off-message and starts to talk about a completely unexpected subject, the translator, respeaker or stenographer may encounter some trouble – but after all, that is what makes the job interesting! Rather than trying to faithfully report every single word, they may settle for a more general abstract of it.
SysMedia recommends that respeakers take four main steps in order to produce real time subtitles:
- learn how to speak so as to get the best possible accuracy levels;
- train the speech recogniser to their voice and their speech commands;
- research the programme to be subtitled and add any new vocabulary or house style;
- practise a little before going on air with the WinCAPS SpeakTitle system;
The system copes with converting the text into subtitles, decoding speech commands to control colour and style, applying House Style rules to tidy the presentation and correct recognition errors, and timing the delivery of the subtitles. Users can choose to present the output in a two-line scrolling format where each new word is added to the right-hand end of the bottom row. This minimises the delay between a word being spoken and seeing it on screen.
Respeakers are selected and trained on their ability to get good results from speech recognition systems. They need to understand how the system is working and what it needs: essentially clearly enunciated text which is spoken consistently and at an even pace. On top of this they need the ability to make accurate editorial decisions regarding what to leave in and what can be taken out if the text has to be slightly edited down to shorten the subtitles and thus reduce the reading speed required. The result must be factually accurate, comprehensible, and cover all the key points.
As far as the results are concerned, with English recognition accuracy levels of 97-98% are possible, at speeds of around 140 wpm. The most common types of errors are small words left out, small words appearing in error, and occasional wrong words being shown. Speech recognisers do not make spelling errors but they may confuse homophones or deliver incorrect grammar.
Simultaneous correction can be used to tidy up the text before it is delivered, but there are two main problems with this. First of all, there is an inevitable penalty in the delay to the text. Secondly, the person correcting must listen to what is being respoken as well as working out where errors have occurred and how to correct them most quickly, and then make the corrections. It is a very demanding job which only few people can do for long periods, and which will add at best a few percent accuracy, at the expense of added costs and delay. For this reason, in the UK, such an approach is not considered necessary.
At the end of the day, respeaking is yet one more tool – albeit a rather new one – in the repertoire of techniques that subtitlers can use for dealing with transcription. This applies to transcription in the general sense, because it does not have to be used just for live programmes. Speech input is now firmly on the list of methods that can be used to transcribe a pre-recorded TV programme in the absence of an accurate script: it may be faster than keyboard input, it is certainly less of a strain, and its accuracy in some languages is now near enough to 100%, so that the need for correction is minimal. It can also be significantly less expensive.
Further progress is needed in spreading the technique of respeaking across the countries and territories where a commitment is being made to extending live subtitling services – and indeed offline subtitling of pre-recorded programmes. The most urgent need is for improved recognition engines for some languages where either engines exist but struggle (e.g. French and German) or no engine exists (Norwegian, Swedish, Hebrew). SysMedia is seeking to promote the development of new speech recognition language engines, and were involved in a project in Denmark working with the Philips Speech Magic system to create a new Danish recogniser.
Other technology advances include the possibility of tele-working: easy-to-operate laptop-based subtitling systems like WinCAPS, coupled with wide area network distribution of programme audio, open up the possibility of speech-based homeworkers providing subtitles even for a live programme without having to visit the broadcast centre.
As services and skills develop and confidence grows, the speech subtitlers tackle more and more demanding material. Sports programming was chosen as an easy starting point due to its slower pace (in some cases) and the possibility to provide a less tightly coupled subtitle commentary that was different to the spoken commentary. News and current affairs are now routinely handled, though sensitive political talk-shows may prove to be the most challenging for any form of subtitling. The normal rules still apply: if there are three people all talking at once then it is almost impossible to subtitle by any method.
As technology advances, the recognition engines which seek to understand speech without specific training from the speaker – i.e. the so-called speaker independent systems – will improve. Currently they are only around 60%-70% accurate under typical TV conditions, but the day may come when such systems are good enough to create very good first-pass offline subtitles. The author does not see the human element being eliminated for a good many years yet, and in the meantime we can look forward to a growth industry stimulated by Europe-wide access services legislation. Now is the time to start lobbying governments with the good news that extensive live subtitling is possible in territories where good speech recognition engines exist, and would certainly be possible if such engines were sponsored in territories where they do not exist. Demonstrating that the tools and techniques are available or can be produced is a crucial step in defeating the argument that live subtitling is too hard. It is not – and SysMedia and its customers have proved it.