Interlingual respeaking and simultaneous interpreting in a conference setting: a comparison

By Annalisa Sandrelli (Università degli Studi Internazionali di Roma- UNINT, Italy)

Abstract

In recent years respeaking has become the preferred method for live intralingual subtitling; based on speaker-dependent speech recognition technology, it is used to subtitle live TV broadcasts and events for the deaf and hard-of-hearing. The interlingual variant of respeaking is beginning to emerge as a translation mode that can provide accessibility to all, across both linguistic and sensory barriers. There are striking similarities between interlingual respeaking and simultaneous interpreting in terms of process; however, the two modes differ greatly in terms of end product, i.e. a set of subtitles vs. an oral translation. This empirical study analysed simultaneous interpreting and interlingual respeaking (from English into Italian) in the same conference, to compare the semantic content conveyed to the audience via the interpreted speeches and the live subtitles. Results indicate greater semantic loss in the subtitles (especially through omissions), but little difference in the frequency of errors causing meaning distortion. Some suggestions for future research and training are provided in the conclusions.

Keywords: interlingual respeaking, simultaneous interpreting, conference setting, multimedia transcription, text reduction

©inTRAlinea & Annalisa Sandrelli (2020).
"Interlingual respeaking and simultaneous interpreting in a conference setting: a comparison"
inTRAlinea Special Issue: Technology in Interpreter Education and Practice
Edited by: Nicoletta Spinolo & Amalia Amato
This article can be freely reproduced under Creative Commons License.
Stable URL: http://www.intralinea.org/specials/article/2518

1. Introduction

Respeaking is a relatively recent technique developed for the intralingual real-time subtitling of TV broadcasts (talk shows, weather forecasts, sports programmes, the news, etc.) and live events (conferences, ceremonies, debates, meetings, and so on). It is based on speaker-dependent speech recognition technology and requires a trained professional called respeaker who

[…] listens to the original sound of a (live) programme or event and respeaks it, including punctuation marks and some specific features for the deaf and hard-of-hearing (DHOH) audience, to a speech recognition software, which turns the recognised utterances into subtitles displayed on the screen with the shortest possible delay. (Romero-Fresco 2011: 1)

Unlike pre-recorded subtitles, live subtitles are displayed either as scrolling continuous text or in blocks of text, depending on the setting and the speech recognition software.[1] There are some spatial constraints related to screen size and, depending on the chosen set-up, the real-time text may be visible to the entire audience or only to the users of the subtitling service. On TV, live subtitles may be integrated into the images broadcast to all the viewers (open captions) or activated by users via a Teletext page (closed captions). In conferences, live text may be beamed directly onto the screen used for slides, projected onto a second screen below, above or to the side of the main one, or relayed to users’ personal devices (smartphones, tablets or laptops) via a network connection.  

Speed is key in live subtitling via respeaking; the original speaker’s speech rate (OSR), the target audience’s reading rate, and the respeaker’s speech rate all play an important role. Respeakers tend to lag behind the original speaker, not only because they must understand and process the incoming message before producing their own speech, but because they have to add punctuation orally. Studies have shown that, if the OSR is up to 180 words per minute (wpm), the respeaker’s lag is between 0 and 20 words, and it increases even more at higher speeds (Romero-Fresco 2009, 2011). Excessive subtitle latency can become a problem in conferences: when speakers use slides, they tend to illustrate one and then move on to the next one, and there is a risk that the subtitle might be displayed after the related slide has already disappeared. Therefore, respeakers must try to strike a balance between the need to reproduce the original speaker’s message and the objective constraints that characterise live subtitling via respeaking. 

Despite its complexity, over the last 15 years respeaking has become the preferred method to produce live subtitles in many countries. The proliferation of TV channels (both satellite and digital terrestrial ones) and the growth of audiovisual content on the Web have increased demand for accessibility services; respeaking is very often the method of choice to ensure access to culture, entertainment and information. In addition, “[a]s societies have become more linguistically diverse, there has also been growing demand for interlingual live subtitling to make programmes accessible for speakers of other languages” (Romero-Fresco and Pöchhacker 2017: 150). Interlingual respeaking (henceforth, IRSP) adds the translation element to respeaking and is essentially a hybrid translation mode:

With regard to the process, ‘interlingual respeaking’ […] is really a form of simultaneous interpreting, while the product […] is a set of subtitles. (Romero-Fresco and Pöchhacker 2017: 158)

As IRSP is a very recent development, there is not much research to determine its viability. This paper aims to contribute to the discussion by presenting a small-scale empirical study based on an MA thesis (Luppino 2016-17) that compared the target language (henceforth, TL) speeches produced by simultaneous interpreters with the subtitles produced by interlingual respeakers working in the same conference. A multimedia data archive was created; then, a smaller sub-corpus of 4 speeches was selected for the study. The focus was on assessing how much of the semantic content of the source language (henceforth, SL) speeches was conveyed to the audience via the interpreted speeches and the subtitles. A dedicated analysis grid was developed and applied to our data to shed some light into the challenges posed by the two modes and to inform future IRSP research and training. The paper begins with a brief overview of respeaking research, with a special focus on IRSP (§2); it then presents the data and methodology in §3, the analysis in §4 and some conclusions in §5.

2. A brief overview of research on interlingual respeaking (IRSP)

As was mentioned in §1, there is relatively little research on respeaking. Starting with the intralingual variant, the focus of the available studies is either on the process or the product, and they are either experimental or empirical. The earliest available studies discussed the similarities between respeaking and simultaneous interpreting (Marsh 2004, Eugeni 2008, Russello 2008-09). More recent studies have tried to pinpoint what makes a good respeaker, i.e. to identify the skills and competences needed to perform this complex task and to determine whether a specific training background can facilitate the acquisition of respeaking skills (Moores 2017, Remael and Robert 2018, Szarkowska et al. 2018).

From the point of view of the end product, respeaking is studied as a form of (live) subtitling, with the related change in semiotic code (from spoken to written) and need for text reduction connected to the speed constraint (Romero-Fresco 2009, Van Waes et al. 2013, Sandrelli 2013). The main focus of the product-oriented studies has been the development of models to assess subtitle accuracy and the analysis of the specific challenges posed by different settings and text types (Eugeni 2009, Romero-Fresco 2011, Sandrelli 2013). The NER model (Romero-Fresco 2011) is the most widely used one to assess the accuracy of live subtitles produced via respeaking.[2] It distinguishes between (software-related) recognition errors and (human) edition errors, and a score is attributed to each error depending on its severity (minor, standard or serious). After testing the NER model on different TV genres, a score of 98 per cent has been suggested as the minimum accuracy threshold for usable intralingual subtitles (Romero-Fresco 2011). The model has been adopted by Ofcom, the UK broadcasting regulator, which commissioned four reports on the quality of live subtitling on British television (Ofcom 2015a, 2015b). Most of the available research on intralingual respeaking has been conducted on TV settings, while the Respeaking at Live Events project (Moores 2018, 2020) is looking at the feasibility of respeaking in museum tours, conferences, lectures and Q&A panels after cinema screenings and theatre shows. The aim is to identify the specific requirements of each setting and produce best practice guidelines to organise services efficiently.

Turning to interlingual respeaking (IRSP), an interlingual respeaker needs to have good interpreting skills to translate the source language speech into the target language, but also needs to be able to use the speech recognition software efficiently, to add oral punctuation, to monitor the output to correct any mistakes, and to coordinate all of those efforts in real time. In addition, respeakers working in live events (such as conferences) need to take into account the multimodal nature of the SL material, which may include not only speeches but also slides, video clips or other visual information. Unsurprisingly, one of the key issues being investigated is whether a background in interpreting or subtitling may facilitate the acquisition of IRSP skills. An interdisciplinary group of scholars based in Poland carried out an interesting experiment on interpreters, translators and a control group of bilinguals; all the participants performed intralingual and interlingual respeaking tasks for the first time, working with a range of video clips with different characteristics (genre, speech rate, number of speakers, and degree of scriptedness). Participants’ eye movements and brain activity were analysed by means of an eye-tracker and an EEG device. The accuracy of the subtitles was assessed via the NER score and by three independent raters who applied specific guidelines. The rich data generated by this experimental set-up have been analysed in several publications; here only the conclusions in relation to IRSP are briefly summarised. Szarkowska et al. (2016, 2017) analysed cognitive load and EEG peaks during the tasks in order to detect respeaking crisis points. They found that respeaking difficulties can be triggered by many different factors, including very slow and very fast speech rates, overlapping speakers, figures and proper nouns, and complex syntax or word play; in IRSP translation difficulties (for example involving idiomatic expressions) were also found to play a major role. Chmiel et al. (2017) analysed ear-voice span (EVS) and pauses in both intralingual and interlingual respeaking. They found that EVS tends to be much longer in IRSP, with the longest EVS being found in the respeaking of the news (scripted, with a high information density and a high OSR); moreover, pauses were longer in IRSP than in intralingual respeaking. These findings “can […] be taken as empirical evidence confirming previous intuitive conjectures according to which interlingual respeaking requires more cognitive effort than intralingual respeaking as it combines two complex tasks: respeaking and interpreting” (Chmiel et al. 2017: 1222). Finally, Szarkowska et al. (2018) focused on accuracy, to verify whether interpreters had a comparative advantage over translators and bilingual controls. Indeed, the interpreters obtained the highest accuracy scores and the lowest text reduction rates (measured via the NER model).

Turning to product-oriented research on IRSP, a reliable method to assess the accuracy of interlingual live subtitles and quality standards for this translation mode must still be defined. Romero-Fresco and Pöchhacker (2017) developed the NTR model, which distinguishes between recognition errors and human errors.[3] Translation errors include both content-related (omissions, additions and substitutions) and form-related errors (grammatical correctness and style). The model acknowledges that some errors are more serious than others in terms of the effect they have on viewers, and distinguishes between minor, major and critical errors (-0.25, -0.50 or -1 point, respectively). Minor errors slightly alter the message but do not hamper comprehension; major errors introduce bigger changes, but the overall meaning of the text is preserved; critical errors result in grossly inaccurate and misleading information and affect comprehension significantly.

The model needs to be validated in various real-life settings to determine whether a 98 per cent accuracy rate is a feasible quality benchmark in IRSP too. At the time of writing, a few studies have applied the NTR model to experimental data. The SMART (Shaping Multilingual Access with Respeaking Technology) pilot project investigated the training background and skill-set which can best support the fast acquisition of IRSP competences (Sandrelli et al. 2019). Between January and February 2018 an IRSP “crash course” of 6-8 hours was delivered face-to-face to 25 subtitling and interpreting trainees from three universities (UNINT, Surrey and Roehampton); the students had varying degrees of expertise in subtitling, consecutive and simultaneous interpreting, and intralingual respeaking. Two IRSP performances per student were collected at the end of the day-course; in addition, students carried out a self-reflective analysis via a retrospective TAP (think-aloud-protocol) after each task and filled in pre- and post- experiment questionnaires. The NTR model was applied to all the performances and questionnaire and TAP data were analysed. The results of the study are the object of a publication (Davitti and Sandrelli forthcoming), but its key conclusions can be summarised here: a training background in interpreting seems to be an advantage, but it is not sufficient, as the best IRSP performers were those students with a composite skillset comprising interpreting/subtitling and interpreting/subtitling/respeaking. Moreover, there was a high degree of variability among subjects, which suggests that personal traits play a significant role in IRSP. This is not entirely surprising, if the “live skills” required to perform in real time (such as concentration and stress management) are taken into account.[4]

Dawson and Romero-Fresco (forthcoming) report on the results of a four-week pilot training course delivered online within the ILSA (Interlingual Live Subtitling for Access) project, the first IRSP course ever developed. 50 students with a training background in subtitling or interpreting participated in the course, which included 3 weekly sessions and was delivered entirely online. After analysing their performances in the final tests, the authors concluded that IRSP is indeed feasible, with over 40 per cent of subjects hitting or exceeding the 98 per cent NTR mark after this relatively short course.[5] On average, the student interpreters performed better than the subtitlers, but some of the latter also did well, so an interpreting background does not seem to be mandatory.

The above studies seem to confirm that IRSP is indeed feasible, albeit challenging, and that a training background in a related discipline such as interpreting or subtitling may be an advantage in the acquisition of IRSP skills. However, as IRSP is still essentially an experimental practice, more empirical data from various settings and involving different language combinations are needed. One of the problems of carrying out empirical research on IRSP is that it is not (yet) a widespread method to produce live subtitles. However, over the past few years some MA dissertations have reported on small studies in specific settings in which an ad-hoc IRSP service had been organised: Marchionne (2010-11) described an experiment in which several TV programmes were subtitled live via IRSP in the French-Italian language combination; Serafini (2014-15) organised an IRSP-based live subtitling service (English-Italian) during a film festival in Italy. Case studies of this kind do not allow for generalisations, but as best practices have yet to be defined, they are useful to test different IRSP set-ups. The present paper contributes to the on-going discussion by providing some empirical data on IRSP in a conference setting.

3. Data and methodology

The data used in this study come from the 5th International Symposium on Live Subtitling, respeaking and accessibility which took place at UNINT on 12 June 2015. As both simultaneous interpreting and (intralingual) respeaking are taught at our university, the Local Organising Committee decided to provide both live subtitles via respeaking and a simultaneous interpreting service in the two official languages of the Symposium, English and Italian.

3.1 The Symposium

The Symposium was a one-day event, with 20 speakers who included academics, software developers and users of live subtitling services (including representatives from deaf associations) from all over Europe, the US and Australia. The event was made up of a morning session on research issues (with an opening section and two thematic panels) and an afternoon session on practical developments (with two thematic panels, a round table and a closing section). Both thematic panels included conference papers and moderators’ introductions, floor allocations, and announcements; some moderators also acted as discussants and, as well as introducing presenters and ensuring time-keeping, encouraged debate by asking questions. Italian was used in the opening and closing sessions and in the final roundtable, while all the research papers were delivered in English, although only two conference presenters were native speakers.

The Italian speeches were simultaneously interpreted into English and subtitled in Italian via intralingual respeaking; the English speeches were interpreted and subtitled into Italian via IRSP. Simultaneous interpreting was provided by five volunteers, all of them interpreting graduates of our university and native speakers of Italian: one of them had about 3 years’ experience, another one about 2 years and the other three had graduated 3 months before the Symposium. The respeaking service was provided by one of our media partners, the onA.I.R. international respeaking association; it involved four respeakers, all of them Italian native speakers and relatively experienced in intralingual respeaking, but not in IRSP, as in Italy there is not much demand yet. The most experienced respeaker was also a trained simultaneous interpreter. All the respeakers used Dragon Naturally Speaking (v. 12) on their laptops.[6]

The event took place in the university conference hall, an auditorium on two levels equipped with a sound system, a large screen for slide projection and 4 sound-proof interpreting booths on the upper floor; two booths were used by the interpreters and two by the respeakers. As the booths are at a considerable distance from the rostrum and the screen, interpreters and respeakers could not see the screen very clearly; therefore, the organisers made sure they received the speakers’ presentations ahead of time. All the advance material (presentations, programme, abstracts, speakers’ biographical information) enabled interpreters and respeakers to familiarise with the topics and prepare their glossaries; the respeakers were also able to train the software by adding new words to the Vocabulary Editor and by creating macros.[7] Figure 1 shows an interpreter’s booth, with the big screen above the speakers’ table just visible in the background: here the interpreter is using her laptop with the PowerPoint presentations and her glossaries, as well as paper copies of materials.

Figure 1: An interpreter’s workstation

All the conference speakers used the same PowerPoint template for their presentations, in which some space was left blank at the bottom of each slide to accommodate a maximum of three lines of subtitles. The subtitles were beamed directly onto the slides by means of the Text-on-Top software. The end result is shown in Figure 2.

Figure 2: A speaker’s slides with 3 lines of subtitle

The above set-up was chosen because the conference hall of our university does not feature two side-by-side screens, and space constraints make it impossible to place an additional screen below or above the main screen, which is common practice in film festivals or theatres.

The entire Symposium was video-recorded with a camera fixed on the speakers’ table; in addition, the interpreters and respeakers recorded themselves by means of digital audio-recorders.

3.2 The multimedia archive and the data for the empirical study

The first step in the creation of the multimedia archive was to edit the video and audio files collected during the Symposium: each SL speech was selected and saved as an individual file, and matching TL audio files were created for the interpreters’ output and the respeakers’ output. All the video and audio clips thus obtained were transcribed orthographically, following the conventions established in the European Parliament Interpreting Corpus (EPIC) project (Monti et al. 2005).[8] A few specific annotations were added for phenomena that occurred during respeaking, such as those cases in which the respeakers entered corrections manually by using the keyboard, or when an SL speaker interrupted the presentation to show a video clip.

It is important to note that the respoken text is made up of all the words the respeakers uttered, including the voice commands for punctuation, macros, and so on; this “intermediary text” (Pöchhacker and Remael 2019) is not meant for the audience but for the speech recognition software, which then processes it to produce the TL subtitles. The respeaker checks the output of the software (and sometimes edits it) before projecting it as subtitles for the benefit of the audience. As the aim of our study was to compare the TL “end products” that reached the audience (the interpreted speeches and the TL subtitles), it was necessary to add a further element to the ELAN transcription layers of the four speeches in question, namely the TL subtitles themselves.[9]

All the transcripts were manually aligned with their corresponding video and audio files by using the ELAN software programme (v. 4.9.4). For each speech, the multimedia file thus obtained includes:

  • a video-recording of the original SL speech;
  • the transcript of the SL speech;
  • the audio recording of the simultaneous interpreter’s TL speech;
  • the transcript of the simultaneous interpreter’s TL speech;
  • the audio recording of the respoken text (the intermediary text dictated to Dragon);
  • the transcript of the respoken text;
  • the text of the TL subtitles displayed to the audience on the screen.

The multimedia file makes it possible to play the original video clip and displays the various transcription layers, thus obtaining a visual representation of the time lag between SL speaker, interpreter and respeaker, as can be seen in Figure 3.

Figure 3: An example of an ELAN multimedia transcript

After preparing the multimedia files of the entire conference, four speeches were selected for our pilot analysis, which was focused exclusively on the English into Italian translation direction. The two speech types that were selected were the “moderator’s introduction” and the “scientific conference paper”. The second selection parameter was the degree of scriptedness, i.e. whether the speech was impromptu or read from the slides or written notes. The resulting sub-corpus consisted of four speeches:

  • two introductions: M1 was a short impromptu Q&A session, consisting of an introduction by the moderator, a question asked by an audience member, and the answers given by two panel speakers; M2 was an introduction in which the moderator read out the speaker’s biographical note;
  • two conference papers: P1 was delivered by a researcher who illustrated his slides speaking impromptu; P2 was delivered by another researcher who read the slides aloud.

Both conference papers had a duration of about 18 minutes, while the introductions were about two and a half and three and a half minutes each. M1 (the impromptu introduction) was much faster than the other three speeches. In the Interpreting Studies literature a speed around 100-120 w/m is considered “comfortable” for interpreting purposes (Pöchhacker 2004:129); in respeaking, the addition of oral punctuation and the need to monitor the subtitles increases the time lag between the original speaker and the respeaker (see §1). In this respect, P1 and M2 were manageable, P2 was a bit fast and M1 was definitely challenging.

SL speeches

duration

 

length

(SL words)

speed

(w/m)

P1

17’ 55’’

1,844

103

P2

18’ 09’’

2,047

113

M1

3’ 24’’

574

169

M2

2’ 29’’

225

91

TOTAL

41’ 57’’

4,620

110

Table 1: The four SL speeches

The four SL speeches were interpreted by two recent graduates, while the subtitles were produced by the most experienced respeaker (who was also a trained interpreter).

3.3 The analysis grid

Each of the four SL speeches was subdivided into idea units on the basis of semantic content (with grammatical and prosodic features helping to determine the boundaries); then, matching idea units were identified (if present) in the interpreted version, in the respoken version and in the TL subtitles. In order to carry out the analysis, a taxonomy that could be applied to all of our TL data had to be developed, with categories describing the relationship between each SL idea unit and corresponding TL idea unit in terms of the information made available to the TL audience. To this end, a review of relevant literature on quality in simultaneous interpreting (including Barik 1971, Altman 1994, Falbo 2002), subtitling (Gottlieb 1992, Díaz Cintas and Remael 2007) and intralingual respeaking (Romero-Fresco 2011) was carried out.

Our taxonomy tries to combine all of the above classifications in three macro-categories, namely semantic transmission, reduction and distortion. Transmission refers to those instances in which the semantic content of the SL idea unit was successfully conveyed by the TL idea unit; reduction refers to cases in which some information is missing in the TL message, which only expresses part of the SL content; and distortion refers to factual alteration of semantic content, i.e. the TL unit expresses a different idea. Each macro-category includes a few sub-categories, as can be seen in Table 2; definitions and examples are provided in §4.

Semantic Transmission

Semantic Reduction

Semantic Distortion

Transfer (T)

Decimation (DEC)

Substitution (S)

Condensation (C)

Omission (O)

Generalisation (G)

Explicitation (E)

 

Addition (A)

Deletion (DEL)

 

 

Table 2: Analysis grid

It is important to note that the labels used here refer to the comparison of SL and TL versions and do not imply the use of deliberate strategies on the part of interpreters and respeakers, as that would amount to speculating about their intentions. Moreover, like in all taxonomies, there is a degree of subjectivity in the application of the categories, and also a degree of overlap between them, which means that it is necessary to select the one that seems to be prevalent in each TL unit (again, a subjective choice). In order to ensure some objectivity, a peer-review system was adopted, in which the two evaluators (present author and the MA student who was writing her dissertation on this topic) coded the texts separately, and then discussed and solved points of disagreement.

4. Analysis

Table 3 shows the number of words and the number of idea units identified in each SL speech, accompanied by the corresponding figures for the interpreted speeches and the TL subtitles produced via IRSP (henceforth, subtitles). The figures in the greyscale columns indicate the number of words, while those in white the number of idea units.

 

SL words

SL idea units

TL words

(Interpreter output)

TL idea units

Interpreter output

TL words

(Subtitles)

TL idea units

(Subtitles)

P1

1,844

194

1,392

138

982

120

P2

2,047

237

1683

201

1109

133

M1

574

55

256

31

215

25

M2

225

27

152

20

129

17

TOTAL

4,620

513

3,482

390

2,435

295

Table 3: Number of words and idea units in the SL and TL texts

In all the TL versions (the interpreters’ speeches and the subtitles) the number of words is lower than in the SL speeches; the same applies to the number of idea units, which is lower in the interpreted versions and drops even more in the subtitles. However, quantitative data of this kind are not enough to determine whether the SL message was reproduced accurately or not, as interpreters and respeakers are trained to translate succinctly in order to cope with the constraints of real-time translation. The analysis that follows investigates the issue more in depth.

The analysis grid presented in §3.3 (Table 2) was applied to both sets of data. Before presenting quantitative results, let us illustrate the classification with definitions and examples. The macro-category of semantic transmission covers all those cases in which the SL idea units were successfully conveyed to the TL audience; it includes the four sub-categories of transfer, explicitation, condensation and deletion. When the content of the SL unit is fully expressed by the TL unit, this is classified as a transfer (Example 1).

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

ok (ehm) so we don’t have a lot of time

 

non abbiamo molto tempo

 

[we don’t have much time]

T

non abbiamo molto tempo

 

[we don’t have much time]

Non abbiamo molto tempo

 

[we don’t have much time]

T

Example 1: Transfer (T)

When the TL idea unit manages to convey the same and complete meaning as the original more concisely, there is a condensation (C). In Example 2, the SL speaker announces he is going to play a short video for a demo, and both the interpreted version and the TL subtitles opt for more succinct structures to introduce the clip (Interpreter: here is a demo; Subtitle: this is a demo.).

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

I can show you a small demonstration

 

ecco qui una dimostrazione

 

[here is a demo]

C

questa è una dimostrazione punto

 

[this is a demo full stop]

Questa è una dimostrazione.

 

[this is a demo.]

C

Example 2: Condensation (C)

If the TL idea unit makes the idea more explicit than it was in the SL, this is classified as an explicitation (E). In Example 3 the speaker is commenting the video clip of a speech recognition software at work. The interpreter makes the meaning more explicit by adding a definition of the technical term latenza (latency), namely the speech recognition delay. By contrast, in the TL subtitles there is a deletion (DEL) of a redundant item, i.e. the reference to the speech recognition software of which the audience is watching a live demo. Deletions consist in the removal of redundant items, such as repetitions or meta-comments (e.g. “as I have already said”), from the TL unit; the disappearance of such items does not affect the semantic content of the TL unit. As both interpreters and respeakers have to wrestle with time constraints, it is important for them to be able to identify redundant material that can be dispensed with.

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

(ehm) you cannot see also the latency of the speech recogniser

 

non potete neanche notare la latenza il ritardo del riconoscimento del parlato

 

[you cannot even notice the latency the speech recognition delay]

E

non riuscite a vedere la latenza (...) virgola

 

[you cannot see the latency comma]

Non riuscite a vedere la latenza,

 

 

[you cannot see the latency,]

DEL

Example 3: Explicitation (E) and deletion (DEL)

Turning to semantic reduction, this macro-category indicates those instances in which the TL idea unit conveys less information than the original: it includes two types, decimation and omission. Unlike deletions, therefore, decimations and omissions do affect the meaning of the TL unit. A decimation (DEC) is a partial loss of information, i.e. the TL subtitles or the TL interpreted version manage to convey part of the message, but some details are missing. In Example 4 the SL speaker is talking about the influence of certain factors on the performance of the speech recognition software: “topic” as a factor is not present in the two translations and the TL users have no way to recover or infer this detail.

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

but it depends on the topic discussed and the speakers

 

ma questo dipende da molti fattori anche dal- l'oratore stesso

 

 

[but this depends on many factors, including the speaker himself]

DEC

a seconda dei fattori virgola come per esempio l'oratore stesso punto

 

[depending on factors comma such as for example the speaker himself full stop]

a seconda dei fattori, come per esempio l'oratore stesso.

 

[depending on factors, such as for example the speaker himself]

DEC

Example 4: Decimation (DEC)

Unlike decimation, which allows for at least part of the semantic content to be conveyed, there is an omission when a given SL idea unit has no matching TL idea unit at all. In Example 5 the speaker is describing the creation of a user’s voice model: both the interpreter and the respeaker skipped the whole idea unit and the information was completely lost in the TL versions.

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

it’s very fast very cheap

 

O

 

 

O

Example 5: omission (O)

Finally, the macro-category of semantic distortion refers to the alteration of semantic content. The most obvious example is substitution, which replaces an SL idea with a completely different idea in the TL unit. Example 6 shows that both the interpreted version and the TL subtitles replaced 100% with 20%, which results in the audience receiving factually wrong information.[10]

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

the speech recognition accuracy is almost one hundred per cent

 

la quindi: l'accuratezza gli errori di accurazio- di accuratezza sono circa del venti per cento

 

[the therefore accura- the errors of accuracy are about twenty per cent]

 

l'accuratezza (…) può essere bassa virgola con un tasso di errore del venti per cento punto

 

[accuracy can be low comma with an error rate of twenty per cent]

l'accuratezza può essere bassa, con un tasso di errore del 20%.

 

[accuracy can be low, with an error rate of 20%]

S

Example 6: Substitution (S)

A generalisation is the inappropriate use of a hypernym or a more general phrase than the original formulation in the SL, which results in the TL unit conveying a different idea. In Example 7 the speaker is talking about the difficulty of devising an automatic punctuation tool and mentions commas as a problem in his language. The TL subtitle mentions “punctuation marks” in general and the resulting sentence implies that in Czech there are more types of punctuation marks than in other languages.

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

because in Cz- in the Czech lin- language we have many commas in the in the sentences

quindi soprattutto nella nostra lingua abbiamo molte virgole nelle frasi

 

 

[so especially in our language we have many commas in sentences]

T

specialmente nella nostra lingua virgola (…) in cui abbiamo molti segni di punteggiatura punto

 

[especially in our language comma in which we have many punctuation marks full stop]

specialmente nella nostra lingua, in cui abbiamo molti segni di punteggiatura.

 

[especially in our language, in which we have many punctuation marks.]

G

Example 7: Generalisation (G)

The last type of semantic distortion is an addition (A), which introduces extraneous elements in the TL unit (unlike explicitation, which clarifies the meaning of the original). In Example 8 the TL subtitles convey the idea that the respeaker may resort to the keyboard to add words to the vocabulary, but the phrase come meglio crede (as he/she sees fit) adds the unwarranted nuance of “as (s)he likes best”.

SL Speaker

Interpreter

 

Respeaker

TL Subtitles

 

so the respeaker can add these words to the system just during subtitling

e quindi il respeaker può effettivamente aggiungere delle nuove parole durante la sottotitolazione

 

 

 

[and therefore the respeaker can actually add new words during subtitling]

T

il respeaker (…) può quindi <interridurre>

[scrive]

le parole come meglio crede tramite la tastiera punto

 

[the respeaker can therefore interriduce - introduce [types] the words as he sees fit via the keyboard full stop]

Il respeaker può quindi introdurre le parole come meglio crede tramite la tastiera.

 

 

[the respeaker can therefore introduce the words as he sees fit via the keyboard]

A

Example 8: Addition (A)

After defining and illustrating each category, let us have a look at their distribution in the TL versions of the four speeches (Table 4).

 

Semantic Transmission

Semantic Reduction

Semantic Distortion

T

E

C

DEL

O

DEC

S

G

A

P1-Int

44

8

23

7

49

21

23

12

7

P2-Int

142

4

15

14

22

15

17

6

2

M1-Int

11

1

11

2

22

0

7

0

1

M2-Int

11

4

0

4

3

4

0

1

0

TOTAL Int

208

17

49

27

96

40

47

19

10

P1-Sub

24

14

17

15

59

23

15

22

5

P2-Sub

48

14

28

29

75

12

21

6

4

M1-Sub

5

0

7

9

21

1

7

2

3

M2-Sub

7

3

0

5

5

5

1

1

0

TOTAL Sub

84

31

52

58

160

41

44

31

12

Table 4. Key: T= transfer; E= explicitation; C= condensation;
DEL= deletion; O= omission; DEC= decimation;
S= substitution; G= generalisation; A= addition

The data in Table 4 show that the number of transfers is much higher in the interpreted output of all four speeches, which means that, overall, the interpreters conveyed the semantic content of the original speeches more fully than the respeakers. Condensation was the second most frequent semantic transmission category in both translation modes, ranking at very similar levels (49 in the interpreted output vs. 52 in the subtitles), but with a different distribution across the four speeches. Predictably, explicitation was less frequent in both translation modes: however, in 31 cases the respeakers thought it necessary to make the subtitles more explicit than the SL idea unit, despite the time and space constraints.

Deletions (DEL) of redundant or implicit elements occurred twice as many times in the subtitles as in SI, but omissions (O) were also very frequent in the subtitles. It would seem that one of the ways in which respeakers try to cope with the complex cognitive demands of IRSP is by cutting out parts of the SL message; while sometimes this affects only redundant elements and the overall meaning is preserved (deletions), more often than not there is some information loss (omissions). Cuts were especially marked in the subtitles of the read academic paper (P2), which had very high information density, and in the translation of M1, which was delivered at a high speed (see Table 1).

Finally, turning to semantic distortions, substitution was the most frequent type in both SI and IRSP; it is worth noticing that there were more generalisations in the subtitles than in the interpreted speeches.

Figure 4 shows all of the above results in a graph and grouped by the three macro-categories, namely semantic transmission, reduction and distortion; the results related to the interpreted version and the subtitles of the same speech are placed next to each other for ease of comparison.

Figure 4: Overall results for semantic transmission,
reduction and distortion across all the TL data

Once again it can be observed that the interpreted speeches reproduced the semantic content of the four SL speeches more accurately than the subtitles: this can be seen in the higher proportion of semantic transmissions and in the lower number of semantic reductions. Moreover, while semantic transmission was the most frequent macro-category in the interpreted speeches, this was not the case in all the subtitled versions: for example, the subtitles of the impromptu conference paper (P1-Sub) featured more reduction than transmission (which equates to content loss); in addition, in the subtitled versions of the two moderators’ introductions (M1-Sub and M2-Sub) the figures for semantic transmission and reduction were similar. However, it is interesting to note that the difference between the interpreted speeches and the subtitles was not so marked in terms of factual errors; exactly the same number of semantic distortions was found in the interpreted and subtitled versions of P1 (42), while the subtitles for P2, M1 and M2 contained only a few more errors than the corresponding interpreted versions.

In relation to text type, the best results were obtained on the read conference paper (P2): in the interpreted version 74 per cent of all the SL units (175 out of 237) were conveyed to the TL audience via semantic transmission, while in the subtitles the percentage dropped to 50 per cent (119). By contrast, P1 (the impromptu conference paper) turned out to be more challenging for both interpreters and respeakers: only 42 per cent of SL units (82 out of 194) were conveyed via semantic transmission in the interpreted version and only 36 per cent (70) in the subtitles. Interpreters and respeakers produced the same number of semantic distortions (42), and semantic reduction was quite marked in both versions (82 vs. 70 in P1-Sub and P1-Int, respectively). Finally, although the short duration of the moderators’ introductions (M1 and M2) makes it difficult to draw reliable conclusions, the impromptu introduction (M1) proved more challenging than the read introduction: both semantic reductions and distortions were far more abundant in both the interpreted and subtitled versions of M1 than in M2. In this case, speed may have played a role too, as this was the fastest speech in the sub-corpus (see Table 1).

5. Discussion and conclusions

As was explained in §3.2, the first tangible result of this study has been the creation of a multimedia archive that includes the parallel data of SL speeches delivered during the Symposium and the TL data produced via SI and IRSP.

Firstly, the importance of prior preparation has been confirmed by our results, which show that both interpreters and respeakers performed better on those speeches they had had the opportunity to prepare in advance (the “read” speeches P2 and M2). This is an aspect that should be stressed in future IRSP courses and highlighted in any best practice guidelines aimed at event organisers, who must be made aware of respeaker needs; as both IRSP and interpreting are cognitively very demanding, thorough preparation is essential.

As regards the application of our categories to the data, the frequency of semantic distortions (i.e. actual translation errors) was roughly comparable in both translation modes, with substitutions emerging as the main problem and generalisations affecting the subtitles more than the interpreted output. The real difference between SI and IRSP in this setting was actually the quantity of information conveyed to the TL audience: thanks to the higher number of semantic transmissions (especially transfers) and the lower number of semantic reductions (especially omissions), the interpreted speeches rendered the SL content more fully.

Of course, our study does not attempt to explain why there were so many omissions in the TL subtitles. Some SL idea units may have been omitted because the respeaker was lagging behind too much; references to visual information in the SL slides may have been missed owing to the concurrent need to monitor the recognition output; or perhaps this trend only affects this small subset of speeches and not the whole Symposium. At the time of writing, it is impossible to say whether a more marked semantic reduction is an intrinsic feature of IRSP vis-à-vis simultaneous interpreting in live conferences; many factors are likely to play a role, including the type of live event, degree of prior preparation, the skills of the professionals involved, the technical set-up and so on. The aim of this contrastive analysis study was not to advocate for one translation mode over the other, but to identify patterns in the available data to inform future IRSP teaching and suggest potential research avenues. Indeed, another interesting result is that there were twice as many deletions in the TL subtitles than in the interpreted speeches, which means that the respeaker was relatively successful at identifying redundant items that could be eliminated. The ability to edit the text successfully is even more important in IRSP than in SI, as the extra effort required to add punctuation and produce written sentences via speech recognition increases the time lag. Therefore, trainee respeakers must learn to carry out real-time text analysis in order to identify any items that can be deleted without affecting the meaning.

Another aspect that certainly requires further investigation is the various technological set-ups for subtitle projection. In our Symposium the subtitles were projected onto the bottom of each PowerPoint slide (see §3.1), but in other settings it might be possible to use an additional screen or provide subtitles directly to users’ personal devices; the influence of different configurations on the IRSP service has not been studied yet. A related issue is the audience reception of live subtitles in a conference setting, namely how the use of different screens (position, size, and so on) can affect audience comprehension; research in related fields, such as opera surtitling, may provide useful hints. Moreover, the ergonomics of respeakers’ workstations also needs to be investigated. While a direct view of speakers and audience is considered a pre-requisite for a simultaneous interpreting service of good quality (AIIC 2011), in the case of respeaking it might be worth experimenting with an in-booth monitor displaying speakers and slides, to investigate whether such a set-up would make it easier for the respeaker to switch from the monitor to the laptop where the speech recognition software is installed. In short, there is a need for more empirical and experimental data to compare different technical configurations and their influence on the delivery of IRSP in conferences and, in due course, to produce a set of best practice guidelines. It is hoped that the present study may be a useful step in this direction.

References

AIIC (2011) “Simultaneous Interpretation Equipment”, aiic.net, 28 November 2011, URL: http://aiic.net/p/4030 (accessed 8 June 2020).

Altman, Jane (1994) “Error Analysis in the Teaching of Simultaneous Interpreting: a Pilot Study”, in Bridging the Gap, Empirical Research in Simultaneous Interpreting, Sylvie Lambert and Barbara Moser-Mercer (eds), Amsterdam/Philadelphia, John Benjamins Publishing Company: 25-38.

Barik, Henri C. (1971) “A Description of Various Types of Omissions, Additions and Errors of Translation Encountered in Simultaneous Interpretation”, Meta 16, no. 4: 199-210. URL: https://www.erudit.org/fr/revues/meta/1971-v16-n4-meta254/001972ar/ (accessed 8 June 2020).

Chmiel, Agnieszka, Agnieszka Szarkowska, Daniel Koržinek, Agnieszka Ljiewska, Łukasz Dutka, Łukasz Brocki and Krzysztof Marasek (2017) “Ear-Voice Span and Pauses in Intra- and Interlingual Respeaking: An Exploratory Study into Temporal Aspects of the Respeaking Process”, Applied Psycholinguistics 38, no. 5: 1201-1227.

Davitti, Elena and Annalisa Sandrelli (forthcoming) “Embracing the Complexity: a Pilot Study on Interlingual Respeaking”, JAT- Journal of Audiovisual Translation.

Dawson, Haley and Pablo Romero-Fresco (forthcoming) “Towards Research-Informed Training in Interlingual Respeaking: an Empirical Approach. The Interpreter and Translator Trainer.

Díaz Cintas, Jorge and Aline Remael (2007) Audiovisual Translation. Subtitling, Manchester, St. Jerome.

Eugeni, Carlo (2008) “A Sociolinguistic Approach to Real-Time Subtitling: Respeaking vs. Shadowing and Simultaneous Interpreting”, in English in International Deaf Communication, Linguistic Insights, Cynthia Jane Kellett Bidoli and Elana Ochse (eds), Bern, Peter Lang: 357-382.

Eugeni, Carlo (2009) “Respeaking the BBC News. A Strategic Analysis of Respeaking on the BBC”, The Sign Language Translator and Interpreter 3, no. 1: 29-68.

Falbo, Caterina (2002) “Error Identification and Classification: Instruments for Analysis”, in Interpreting in the 21st Century: Challenges and Opportunities. Selected papers from the first Forlì Conference on Interpreting Studies, 9-11 November 2000, Giuliana Garzone, Peter Mead and Maurizio Viezzi (eds), Bologna, CLUEB: 111-128.

Gottlieb, Henrik (1992) “Subtitling. A New University Discipline”, in Teaching Translation and Interpreting. Training, Talent and Experience. Papers from the First Language International Conference, Elsinore, Denmark, 31 May-2 June 1991, Cay Dollerup and Anne Loddegaard (eds), Amsterdam/Philadelphia, John Benjamins Publishing Company: 161-170.

Luppino, Denise (2016-17) Interpretazione simultanea e respeaking interlinguistico: due modalità di traduzione a confronto. Un contributo alla ricerca, MA diss, UNINT, Italy.

Marchionne, Francesca (2010-11) Il respeaking interlinguistico. Sperimentazioni per una evoluzione della sottotitolazione in diretta, MA diss., University of Macerata, Italy.

Marsh, Alison (2004) Simultaneous Interpreting and Respeaking: a Comparison, MA diss., University of Westminster, UK.

Monti, Cristina, Claudio Bendazzoli, Annalisa Sandrelli and Mariachiara Russo (2005) “Studying Directionality in Simultaneous Interpreting through an Electronic Corpus: EPIC (European Parliament Interpreting Corpus)”, Meta 50, no. 4, December 2005. URL: https://www.erudit.org/fr/revues/meta/2005-v50-n4-meta1024/019850ar.pdf (accessed 8 June 2020).

Moores, Z. (2020) “Fostering Access for All through Respeaking at Live Events”, Jostrans Vol. 33: 207-226. URL: https://jostrans.org/issue33/art_moores.php. (accessed 8 June 2020).

Moores, Zoe (2017) “Respeaking Profiles: What Makes a ‘Good’ Respeaker? Does a Particular Profile Exist?”, Paper presented at the International Conference on Audiovisual Translation, Poznań, 25-26 September 2017.

Moores, Zoe (2018) “Respeaking at Live Events: Ensuring Quality in Diverse Settings”, Paper presented at the 6th International Symposium on Accessibility and Live Subtitling, Milan, 14 September 2018.

Ofcom (2015a) Ofcom’s Code on Television Access Services, 13 May 2015, URL: https://www.ofcom.org.uk/__data/assets/pdf_file/0016/40273/tv-access-services-2015.pdf (accessed 8 June 2020).

Ofcom (2015b) Measuring Live Subtitling Quality. Results from the Fourth Sampling Exercise, 27 November 2015, URL: https://www.ofcom.org.uk/__data/assets/pdf_file/0011/41114/qos_4th_report.pdf (accessed 8 June 2020).

Pöchhacker, Franz (2004) Introducing Interpreting Studies, London/New York, Routledge.

Pöchhacker, Franz and Aline Remael (2019) “New efforts? A competence-oriented task analysis of interlingual live subtitling”, Linguistica Antverpiensia no 18: 130-143. URL: https://lans-tts.uantwerpen.be/index.php/LANS-TTS/article/view/515/471 (accessed 8 June 2020).

Remael, Aline and Isabelle Robert (2018) “Live Subtitlers. Who are they?”, Paper presented at the 6th International Symposium on Accessibility and Live Subtitling, Milan, 14 September 2018.

Romero-Fresco, Pablo (2009) “More Haste Less Speed: Edited vs. Verbatim Respeaking”, Vigo International Journal of Applied Linguistics (VIAL) Vol. 6: 109-133. URL: http://vialjournal.webs.uvigo.es/pdf/Vial-2009-Article6.pdf (accessed 8 June 2020).

Romero-Fresco, Pablo (2011) Subtitling through Speech Recognition: Respeaking, Manchester, St Jerome Publishing.

Romero-Fresco, Pablo and Franz Pöchhacker (2017) “Quality Assessment in Interlingual Live Subtitling: The NTR Model”, Linguistica Antverpiensia New Series: Themes in Translation Studies 16: 149-167. URL: https://lans-tts.uantwerpen.be/index.php/LANS-TTS/article/view/438 (accessed 8 June 2020).

Russello, Claudio (2008-09) Respeaking e interpretazione simultanea: un’analisi comparata e un contributo sperimentale, MA diss., LUSPIO-Rome.

Sandrelli, Annalisa (2013) “Reduction Strategies and Accuracy Rate in Live Subtitling of Weather Forecasts: a Case Study”, Paper presented at the 4th International Symposium on Live Subtitling: Live Subtitling with Respeaking and Other Respeaking Applications, Barcelona, 12 March 2013.

Sandrelli, Annalisa, Elena Davitti and Pablo Romero-Fresco (2019) “Triangulating Quantitative and Qualitative Data Across Different Subject Groups”, Paper presented at the International Conference Media for All 8- Complex Understandings, Stockholm University, 17-19 June 2019.

Serafini, Gloria (2014-2015) La sottotitolazione interlinguistica in tempo reale: il caso Ericsson Olanda e l’esperimento presso il festival Sedicicorto 2014, MA diss., University of Bologna.

Szarkowska, Agnieszka, Krzysztof Krejtz, Łukasz Dutka and Olga Pilipczuk (2016) “Cognitive Load in Intralingual and Interlingual Respeaking – a Preliminary Study”, Poznán Studies in Contemporary Linguistics 52, no. 2: 209-233.

Szarkowska, Agnieszka, Krzysztof Krejtz, Łukasz Dutka and Olga Pilipczuk (2018) “Are Interpreters Better Respeakers?”, The Interpreter and Translator Trainer 12, no. 2: 207-226.

Szarkowska, Agnieszka, Łukasz Dutka, Olga Pilipczuk and Krzysztof Krejtz (2017) “Respeaking Crisis Points. An Exploratory Study into Critical Moments in the Respeaking Process”, in Audiovisual Translation – Research and Use, Mikolaj Deckert (ed.) Frankfurt am Main/Bern/ Bruxelles/New York/Oxford/Warszawa/Wien, Peter Lang: 179- 201.

Van Waes, Luuk, Mariëlle Leijten and Aline Remael (2013) “Live Subtitling with Speech Recognition. Causes and Consequences of Text Reduction”, Across Languages and Cultures 14, no. 1: 15-46.

Notes

[1] See Romero-Fresco (2011) for an overview of respeaking practices in various countries.

[2] N stands for the overall number of words in the subtitles; E stands for “edition errors”, namely the errors made by the respeaker; and R indicates “recognition errors”, i.e. the errors made by the software when converting spoken data into written text.

[3] N stands for the overall number of words in the subtitles, T for “translation errors” (made by the respeaker) and R indicates “recognition errors” (made by the software).

[4] The results of the pilot study will be used as the basis for a full-scale SMART project (funded by the UK’s Economic and Social Research Council and starting in July 2020) involving professionals, rather than students.

[5] As students worked on the materials from home and in their own time, it is not possible to quantify the exact number of hours they spent practising IRSP.

[7] Dragon Naturally Speaking has an in-built Vocabulary. Words not included in the Vocabulary (specialised terms, neologisms, foreign words, names and so on) can be added to it and their pronunciation can be recorded by the user, to enable recognition. Moreover, it is possible to create shortcuts (“macros”) for frequently used phrases: for example, instead of dictating “the UN Security Council”, a respeaker may create a voice command such as “UNSEC”, which produces the transcription of the whole phrase.

[8] The transcripts include repetitions, self-corrections, truncated and mispronounced words, silent and filled pauses (only pauses of 2 seconds or longer were transcribed).

[9] Both the respoken text and the TL subtitles were included, as the comparison between the two can offer some insight into man-machine interaction (for example, it can help to distinguish an edition error from a recognition error caused by unclear articulation).

[10] The original speaker’s heavily accented pronunciation may have caused the error.

About the author(s)

Lecturer in English Language and Translation at UNINT in Rome. A conference interpreter by training (Trieste, English and Spanish), she taught at the universities of Hull, Bologna at Forlì and Trieste before joining UNINT in 2008. She teaches the Dialogue Interpreting (English-Italian) and Interlingual Respeaking modules on the MA in Interpreting and Translation, and the Subtitling and Audiodescription modules on the MA in Audiovisual and Multimedia Translation and Adaptation for Subtitling and Dubbing. Her research interests include Computer Assisted Interpreter Training (CAIT), corpus-based Interpreting Studies, Audiovisual Translation (dubbing, subtitling and respeaking), Legal Interpreting/Translation and Legal English. She has taken part in several international and national research projects, including: EPIC (European Parliament Interpreting Corpus) at the University of Bologna; 3 EU-funded projects on legal interpreting and translation (Building Mutual Trust, Qualitas, Understanding Justice); she created the FOOTiE (Football in Europe) corpus and coordinated the DubTalk/TVTalk project on dubbing and subtitling. She coordinates the English unit of the Eurolect Observatory and is a member of the LARIM research group on interpreting and of the GALMA observatory. She is currently International Co-investigator on the ESRC-funded Shaping Multilingual Access with Respeaking Technology project (led by the University of Surrey); she is also coordinating the “¡Sub! Localisation workflows that work” project (UNINT-Roehampton).

Email: [please login or register to view author's email address]

©inTRAlinea & Annalisa Sandrelli (2020).
"Interlingual respeaking and simultaneous interpreting in a conference setting: a comparison"
inTRAlinea Special Issue: Technology in Interpreter Education and Practice
Edited by: Nicoletta Spinolo & Amalia Amato
This article can be freely reproduced under Creative Commons License.
Stable URL: http://www.intralinea.org/specials/article/2518

Go to top of page