On the importance of an encoding standard for corpus-based interpreting studies
Extending the TEI scheme
By Marco Cencini
Abstract & Keywords
This paper stresses the importance of an encoding standard in the compilation of interpreting corpora. It highlights some of the problems in compiling an interpreting corpus, showing the benefits that could derive from the use of the standard encoding scheme proposed by the Text Encoding Initiative (TEI: Burnard & Sperberg-McQueen 1994). As an example, it examines the way in which a Television Interpreting Corpus (TIC) (Cencini 2000) has extended the TEI scheme to cover interpreting data.
Questo articolo sottolinea l’importanza di uno standard di codifica per i corpora di interpretazioni. L’autore affronta alcuni dei problemi relativi alla creazione di un corpus di l’interpretazioni, mettendo in luce i benefici che possono derivare dall’uso dello schema di codifica proposto dalla Text Encoding Initiative (TEI: Burnard & Sperberg-McQueen 1994). A titolo di esempio, esamina il modo in cui un corpus di interpretazione per la televisione (TIC, Television Interpreting Corpus ) ha esteso lo schema TEI ai dati di interpretazione.
Keywords: corpus linguistics, corpus-based translation studies, interpreting coprora, corpus annotation, television interpreting corpus tic
©inTRAlinea & Marco Cencini (2002).
"On the importance of an encoding standard for corpus-based interpreting studies"
inTRAlinea Special Issue: CULT2K
Edited by: Silvia Bernardini & Federico Zanettin
This article can be freely reproduced under Creative Commons License.
Permanent URL: http://www.intralinea.org/specials/article/1678
1. The past
1.1. The background
Since their introduction in the 1960’s, corpora have proved helpful tools for both linguists and professionals. Despite the heated debates triggered by their use, important advances have been made thanks to the systematicity of the analyses they make possible; as Sinclair (1991: 1) points out, the observation of language corpora has helped linguists to discover features (i.e. the patterned nature of language) which simple speculation had undervalued, and corpora today represent an essential tool for linguists and lexicographers, as well as translators and other professionals.
From the 1960s onwards, corpora have gained importance to the extent that today there are large, general corpora available for virtually all the major European languages (as examples we might quote the Corpus de la Real Academia — CREA — for Spanish, or the British National Corpus — BNC — for English). But, beside general corpora, other types of corpora have been collected. For instance, following the need of translators, comparable and parallel corpora have been developed so as to provide translators with original and/or translated material on a specific subject to give lexical, grammatical or stylistic help to translators. Today, the usefulness of such corpora is also endorsed by scholars as a means of overcoming the shortcomings of traditional tools for translation such as bilingual dictionaries (see e.g. Williams 1996).
Despite the widespread use of corpora amongst translators (for both scientific and professional purposes), little attention has to date been devoted to corpora in the field of interpreting. It is enough to say that, today, there is no extensive corpus of interpreting data publicly available. Such a corpus would probably not be as useful as a comparable corpus for, let us say, lexical queries. On average, it would generally be less reliable than a comparable corpus in a specific field, since comparable corpora are more likely to contain material produced by experts in that field, which interpreters are generally not. Beside being generally more reliable from this perspective, comparable corpora are also easier to collect (see below for the problems related to compiling interpreting corpora).
Yet, though a corpus of interpreting might not be as (immediately) useful for professional interpreters as a comparable corpus, it would be extremely interesting for scientific research purposes. As Pöchhacker (2002) points out, to date there are very few corpus-based studies of interpreting. As a partial justification for this situation we find the fact that there is no interpreting corpus to turn to, and there are many questions which need to be tackled before embarking on the project of compiling a corpus of interpreting data.
Nevertheless, the results obtained from the introduction of corpora in different fields suggest that interpreting studies too might benefit from the creation of reference corpora in this area - that is, sets of widely available transcriptions of interpreter-mediated events which can be used and analysed by linguists and researchers to test and validate theories about interpreting. The first, immediate advantage of such corpora would be having a database readily analysable for research purposes without the need to transcribe data ad-hoc. This would have two main positive consequences: on the one hand, it would reduce the time needed to carry out a study, thus, hopefully, contributing to increase the number of studies on interpreting. On the other hand, studies could be based on greater quantities of data, with a reduction in the tendency to generalise from single cases.
As usual in any research practice, the final goal would be to raise awareness about what interpreting is and what processes (linguistic, pragmatic, practical or cognitive) are engaged during an interpretation. As Straniero Sergio (1999: 323) puts it,
it is only through the empirical observation of regularities of situations and behaviour that it is possible to create corpora, which, in turn, enable the determination of norms (in Gideon Toury’s sense), a major lacuna in the field of interpreting.
In the next section we will try to highlight some of the problems encountered in compiling an interpreting corpus and propose ways of solving them.
1.2. Interpreting corpora: problems 
Interpreting is, by definition, a spoken phenomenon, thus any attempt to compile a corpus will entail some primary sort of transcription. Far from being a simple, undemanding activity, transcription poses important practical and theoretical questions which need to be faced before planning a corpus.
a. Recordings are hard to obtain
It is not easy to get recordings of real events of interpretation, both because interpreters are often reluctant to be recorded, and because speeches at conferences and meetings are often held as confidential material. A corpus builder will have to get permission to reproduce the texts s/he wants to include in the corpus.
b. Transcription is time-consuming
Even if recordings are available, their transcription is a very time-consuming practice, and this poses severe limitations on the quantity of data collectable by the single researcher. To give an idea of the time needed to transcribe a corpus we might say that, as a general rule, the time usually estimated for the transcription of a one-minute, multi-party conversation is one hour, a figure which may increase according to the number of features one wants to include in the transcription (see below). A reference corpus of interpreting is not therefore an easy project, requiring considerable investment.
c. Transcriptions are partial
Any transcription is (inevitably) a partial mirroring of an interaction, which cannot give an exhaustive representation of an event (Edwards 1995: 19). As a result, the feasibility of a study on interpreting depends on the features present in a transcription. If we are interested in a specific aspect but the transcriber has not included the relevant features, the transcription is useless. All the researcher can do is to go back to the original recording and add to the transcription so as to bring it into line with their requirements.
Unlike case studies, where transcriptions are generally targeted for a specific objective, a corpus will probably be used for a wide range of studies, potentially increasing the number of features to be included, or, alternatively, requiring ways to access the original audio/video recordings as well as the transcripts.
d. Transcription conventions are non-standardised
Even where the features transcribed are those of interest to the researcher, there can be considerable diversity in the ways in which they are represented in transcriptions. As Stig Johansson has pointed out (1995: 83) differences in transcriptions are not so much due to the different features which are taken into consideration, but to the variety of symbols and conventions used to represent these. A corpus-builder would have to find ways to tidy up the plethora of conventions used.
e. Data is not interchangeable
Finally, we must also consider the technical aspects of transcription regarding the methods by which an interpreting corpus is to be read, analysed and interchanged in an electronic format. Today, virtually everyone who needs to transcribe something does it on a computer, using a word processor. Word processors are extremely helpful tools for text production, but the number of programmes currently available makes it difficult for a corpus builder to choose from a panoply of different (and often conflicting) computer formats.
So far, interchange has not been perceived as a problem in interpreting studies, essentially because it has been very limited: researchers have produced and analysed their own transcriptions, and very few researchers have shared their transcriptions with anybody else. There is no single reference corpus to turn to if one wants to test a theory. So far, everyone has had to transcribe (and, only after that, use) their own texts. The creation of a reference corpus demands the adoption of a format accessible to any computer, regardless of the platform (Windows, MacIntosh, Unix etc.) or the applications used.
f. Tools of analysis are limited
The use of corpora would also imply a change in the methodology used in interpreting studies. In this respect, the introduction of a reference corpus would have important consequences. Currently, interpreting studies dealing with transcripts are generally concerned with short, detailed transcriptions of single events. Transcriptions being short, the information needed for a study is easily retrieved by reading through them. Since the intended reader of the transcript is, generally, the researcher, decisions concerning the way in which to represent the transcript are mainly a matter of having a format which is as clear as possible for a human reader (see 1.2.4. above).
In this kind of studies, moreover, the transcriber of the data is generally the same person as the one who actually analyses them; even where this is not the case, the data analysed are generally of limited extent. For this reason, it is possible to carry out the analysis by simply reading the transcripts. The study is carried out on known data and the analysis can, to some extent, be driven by that knowledge.
Corpus-based studies start from a different premise. The researcher has to analyse a phenomenon without knowing in detail the characteristics of the data at hand. The quantity of data does not allow theories to be tested by reading, and therefore there is the need to approach the data through computer queries to interrogate the corpus. This means that the corpus needs to have a data retrieval system which can locate selected phenomena of interest. The analysis is thus carried out on unknown data and is driven by the results of computer-interfaced queries rather than by direct interaction with the transcriptions.
Summing up, the problems listed in the last section pose interesting questions that need to be solved if we are to create corpora of interpreting data that can serve as common instruments for research. In particular, there is an evident need to establish a standard format which does away with all the various ad hoc conventions adopted in the literature; for the corpus to be interchangeable, this format needs to be platform- and application-independent and, at the same time, to ensure the possibility of mechanically locating features of interest. Besides careful planning of the features to be included in a transcription, the partiality of any transcription should be recognised, and ways devised to interface the original audio/video recordings with the transcripts.
2. The present
The problems listed in §§ 1.2.1-1.2.6 are inherent to any interpreting corpus, but, apart from § 1.2.1, are also shared by any spoken corpora. After all, an interpreting corpus is first of all a spoken corpus, and there already are spoken corpora available which have dealt with at least some of the problems listed. In order to solve them, we can first turn to see how similar problems have been tackled elsewhere.
As we noticed, a possible solution to overcome the problems deriving from a spoken corpus is the adoption of an encoding scheme. In this respect, the most interesting solution proposed is certainly that of the Text Encoding Initiative (TEI) (Burnard & Sperberg-McQueen 1994), adopted (among others) by the British National Corpus. TEI is already set to become a norm for many academic communities, and its adoption could help give an answer to the questions posed by the compiling of an interpreting corpus.
In the next two sections we will try to highlight the most interesting features of the Television Interpreting Corpus (TIC: Cencini 2000), a TV interpreting corpus which tries to adopt the TEI proposal to encode interpreting data. After looking at its technical characteristics, we will then move on to show the way in which the TIC overcomes some of the shortcomings listed above and the benefits deriving from the encoding scheme used.
2.1 The Television Interpreting Corpus
The Television Interpreting Corpus is the result of the attempt to adapt the TEI encoding scheme for interpreting corpora. It is a pilot corpus consisting of transcriptions of four hours of TV broadcasts involving an interpreter. In its current version, it is made up of 11 files amounting to approximately 40,000 words of speech. While only a small corpus, it has provided a useful means to verify the adaptability of TEI to interpreting data, and to identify its potential limits and advantages .
Most attention has been paid to features which appear specific to interpreting events, rather than to ones also present in other types of spoken texts, such as pausing and prosody, whose encoding in TEI has already been tackled elsewhere. At present, the mark-up provides information about
- the participants in the interaction
the identity of speakers, together with information concerning their mother tongue and their roles in the interaction (guests, interviewers, interpreters, etc.);
- the context of the interaction
information concerning the setting, the time and the nature of the broadcast;
- correspondences between utterances
semantic relationships between primary participants’ and interpreters’ utterances, showing what is a translation of what;
- the position of the interpreter
in TV interpreting, the position of the interpreter with respect to the public at home may be on-screen (therefore visible for viewers) or off-screen (with only the interpreter’s voice being audible);
- the interpreting mode
information concerning the mode chosen for the interpretation, i.e. simultaneous, consecutive or chuchotage;
- overlapping speech
in simultaneous and chuchotage interpreting we can distinguish two kinds of overlap; in Cynthia Roy’s words (1996: 56),
one kind of overlap is constant — when interpreters begin interpreting several seconds after a primary speaker has begun. This kind of simultaneous talk of speaker and interpreter, which can also be seen or heard by the two speakers, is a marker of the unusual nature of an interpreting event. This interlingual overlap becomes an accepted norm of these face-to-face encounters […]. There is, however, another kind of overlap […] which occurs between the two primary speakers. This overlap can be easily understood given that turns can be self-generated and/or two participants engage in simultaneous talk.
The TIC provides mark-up for both types.
- non verbal features
such as laughter, filled pauses, etc.
For further technical details regarding the encoding of these features see Cencini (2000) and Cencini & Aston (2002).
2.2 Summarising the solutions
2.2.1. TV interpreting
As for the interpreting genre analysed, TV interpreting was chosen for this pilot study for the reason that it appears to be a growing and increasingly popular field of work for interpreters in Italy. Thanks to the frequency with which foreign guests are invited to appear on television, some interpreters are becoming popular TV personalities. It also makes television interpreting fairly readily available as data, in contrast with the scarcity of recordings available from conference settings. Moreover, TV interpreting offers a wide variety of aspects and features to encode, probably richer than conference settings, making it a more interesting genre on which to test the TEI scheme.
2.2.2. TEI is platform- and application-independent
Any text encoded in TEI is accessible from any computer, regardless of the text editor one is using. Technically speaking, the TEI scheme exploits the extensible Markup Language (XML), a language that will be soon employed for the creation of web sites; for this reason, TEI/XML documents are also accessible from web browsers, which also offer interesting formatting advantages (see § 2.2.3.).
2.2.3. TEI is a standard format
TEI does away with non-standardised transcription conventions, converting them into machine-friendly tags; yet, far from being machine-oriented, TEI documents can be easily displayed in reader-friendly formats. In fact, thanks to XML technology, it is possible for researchers to re-format transcriptions with style sheets so as to display them in reader-friendly fashions using the conventions researchers are more familiar with. XML compatibility thus combines the advantages of a machine-friendly format for data retrieval with those of easy reading for human users. For further information on the use of XML style sheets, see Cencini (2000) and Cencini & Aston (2002).
2.2.4. Transcription is still time consuming, but…
The adoption of an encoding standard certainly does not make transcription a less time-consuming practice; yet, the availability of a corpus one can refer to endows researchers with data to analyse with no need to transcribe ad hoc recordings.
2.2.5. Tools for analysis
Providing that they are appropriately encoded, TEI documents allow researchers to mechanically retrieve features of interest from the corpus. As for the programmes available for this purpose, SARA, the software originally developed for the BNC, is to be released in a version which can work on any (SGML or XML) corpus. This programme makes it possible to make not only lexical queries, but also, in the case of the TIC, any search involving the features listed in § 2.1.
3. The future
3.1. Hypertextual transcriptions and more analytical tools
In its present version, the TIC only includes the transcriptions of the original TV broadcasts, and therefore if the features included in the transcriptions are not enough for a specific study, the transcriptions are useless. Yet, XML compatibility makes it technically feasible to link the transcripts to digitised audio, allowing the researcher to hear the original recording at the same time as viewing the corresponding transcript. In the future, it would therefore be possible to have hypertextual transcriptions linked to the original recordings that, in turn, could offer researchers more detailed data even in those cases in which transcripts are not sufficient. Furthermore, this possibility would reduce the effects of particular transcription biasses, making the analysis less dependent on these (Chafe 1995: 54).
Apart from this possibility, a major requirement for (future) corpus-based research into interpreting is the development and improvement of retrieval tools. This would include parallel concordancers which would make it possible to look for a word and, at the same time, all the renditions proposed by an interpreter of that word, and, possibly, audio concordancers, allowing researchers to retrieve not only the bit of text where a specific keyword appears, but also the corresponding section of video/audio.
The TIC provides only a tentative set of proposals for encoding interpreting data, with the aim of fostering a different approach to interpreting studies. Clearly it would be presumptuous to imagine that it has solved all the problems involved. It is only a small corpus, and it would be desirable to see other corpora compiled in order to stimulate debate and to correct mark-up solutions which prove inappropriate or difficult to handle; similarly, it would be useful to enrich the mark-up with prosodic and part-of-speech tagging. However TEI seems sufficiently flexible and extendible to allow for the encoding of virtually any feature of interpreting texts. The goal would be to move from a (single) pilot corpus to a larger reference corpus; this also implies a shift from TV interpreting to other settings such as conferences, courtrooms, negotiations, etc. Provided the setting is adequately encoded in each case, TEI endows researchers with the tools necessary to carry out contrastive searches so as to compare interpreting in different contexts.
Were these developments to be introduced, they could result in more numerous and more accurate studies of interpreting. Their goal would be an increased awareness of what interpreting involves, providing teachers and students of interpreting with the means to teach, learn and, finally, work better as interpreters.
Burnard, L. & M. Sperberg McQueen, eds. (1994). Guidelines for electronic text encoding and interchange. Chicago/Oxford: ACH, ACL, ALLC.
Cencini, M. (2000). Il Television Interpreting Corpus (TIC). Proposta di codifica conforme alle norme TEI per trascrizioni di eventi di interpretazione in televisione. Forlì: SSLMIT. Unpublished dissertation.
Cencini, M. & G. Aston (2002). “Resurrecting the corp(us|se): towards an encoding standard for interpreting data”. In G. Garzone & M. Viezzi (eds.), Interpreting in the 21st century. Amsterdam & Philadephia: John Benjamins, 47-62.
Chafe, W. (1995). “Adequacy, user-friendliness, and practicality in transcribing”. In Leech G., G. Myers & J. Thomas eds., Spoken English on computer: transcription, mark-up and application. London: Longman, 54-61.
Edwards, J. (1995). “Principles and alternative systems in the transcription, coding and mark-up of spoken discourse”. In Leech G., G. Myers & J. Thomas eds., Spoken English on computer: transcription, mark-up and application, London: Longman. 19-34.
Johansson, S. (1995). “The approach of the Text Encoding Initiative to the encoding of spoken discourse”. Leech G., G. Myers & J. Thomas eds., Spoken English on computer: transcription, mark-up and application. London: Longman. 82-98.
Pöchhacker, F. (2002). “Researching interpreting quality: models and methods”. In G. Garzone & M. Viezzi (eds.), Interpreting in the 21st century. Amsterdam & Philadephia: John Benjamins, 95-106.
Roy, C. (1996). “An interactional sociolinguistic analysis of turn-taking in an interpreted event”. Interpreting, 1: 39-67.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Straniero Sergio, F. (1999). “The interpreter on the (talk) show. Interaction and participation frameworks”. The translator, 5: 303-326.
Williams, I.A. (1996). “A translator’s reference needs. Dictionaries or parallel texts”. Target 8: 2. 227-299.
 I warmly thank Guy Aston for reading the first draft of this paper. His kind, precious advice is a constant model of clarity and insight. Any error or inconsistency in the present paper is to be ascribed to the author.
 See also Cencini & Aston (2002) for further details.
 I am grateful to Lou Burnard for his help in extending the TEI specification to cover interpreting data.