Using a Source Language Corpus in Translator Training

A resource for teaching and testing activities

By Stephen Coffey (Università di Pisa)

Abstract & Keywords


This paper suggests ways in which a corpus of the source language (SL) may prove to be of use in translator training. Two distinct uses are identified. Firstly, I briefly discuss the notion of the SL corpus functioning as a translation aid; secondly, I look in more depth at how the SL corpus may function as a source of teaching and testing materials. Here, a distinction is made between the extraction of longer sections of text which will function as complete passages to be translated, and the extraction of examples of specific SL language features which the teacher wishes to focus on.


Questo articolo tratta dell’uso dei corpora nella preparazione dei futuri traduttori. Studi fatti finora in questo campo prendono in considerazione prevalentemente corpora bilingue, sia comparabili sia paralleli, e corpora monolingui nella lingua di arrivo. In questo articolo si illustrano i motivi per i quali si ritiene che anche corpora nella lingua di partenza dovrebbero svolgere un ruolo importante nell’insegnamento della traduzione, soprattutto come fonte di materiale autentico che formerà la base per esercizi riguardanti quei fenomeni testuali che potrebbero causare particolari problemi per il traduttore non esperto. Viene indicato in che modo si può passare dall’identificazione di un potenziale problema alla creazione di materiale didattico, illustrando questo passaggio con un esempio concreto riguardante la traduzione dall’italiano all’inglese.

Keywords: corpus linguistics, linguistica dei corpora, translation training, translator and interpreter training, formazione interpreti e traduttori, testing

©inTRAlinea & Stephen Coffey (2002).
"Using a Source Language Corpus in Translator Training"
inTRAlinea Special Issue: CULT2K
Edited by: Silvia Bernardini & Federico Zanettin
This article can be freely reproduced under Creative Commons License.
Permanent URL:

1. Introduction: From ‘compare-able’ corpora to monolingual corpora

Most discussion relating to the use of corpora in translation studies concerns itself with the comparison of different, but relatable, texts or sets of texts. Some of the more important areas of comparison are the relationships between: (i) source texts and their renderings in other languages, (ii) comparable but independently created texts in different languages, (iii) authentic texts in a given language and comparable translated texts in the same language, and (iv) different translations in the same language originating from a single source text. The many areas of intertextual comparison are accompanied by a corresponding diversity of compare-able corpora types of one sort or another - I use the word form compare-able to differentiate it from the more specific term ‘comparable corpora’, which normally refers to relationship (ii) above. Some studies dealing wholly or especially with compare-able corpora in translator training are Aston (1999), Gavioli (1999), Gavioli / Zanettin (2000), and Zanettin (1994, 1998, & 1999).

In the midst of this rich and extremely positive interest in compare-able corpora, it should not be forgotten that typologically simple, monolingual corpora can also have an important role to play, above all in certain areas of applied translation studies. Here, we are sometimes more interested in the procedure and process of translating than in the already completed product of translation. Since the act of translating necessarily involves a temporal dimension, we may choose to concentrate our attention on specific steps in the translation process. Thus, a corpus of the source language may be of interest when we are focussing on the yet-to-be-translated source text, and a target language (TL) corpus may be of interest when the target text is being composed. This is true as regards both the world of the professional translator (with reference especially to translation aids) and the field of translator training.

The use of the TL corpus, in particular, has been discussed by Bowker with reference to both professional translators (Bowker, 2000) and the training of translators (Bowker, 1998). Stewart (2000) also discusses aspects of the use of the TL corpus in translating, as do Friedbichler & Friedbichler (2000).

In the present paper, I will focus, by contrast, on the source language (SL) corpus, suggesting some ways in which it might be usefully incorporated into a translator training programme. It should be noted that these are no more than proposals since I am not reporting on procedures which have been systematically tried and tested. I will discuss two very different uses of the SL corpus in translator training. The first, in Section 2, I only touch upon - the second, in Section 3, I describe in more depth.

2. Learning how to use a SL corpora as a translation aid

One general application of SL corpora is that of giving the student translator the opportunity to learn how a SL corpus may function as a translation aid. There is nothing unique to a SL corpus in this regard: all types of translationally-relevant corpora should be used in translator training in order to teach students how  corpora may prove useful to them in their future professional lives, and to give them as much relevant practice as possible. I should add that a SL corpus is by no means the most useful corpus as a translation aid: comparable, translation (parallel), and TL corpora all probably have more to offer.

As is the case with other corpora types, a SL corpus is a potential source of both linguistic and encyclopoedic information. In the former case, the general use to which the corpus may be put is that of providing further information regarding aspects of the source text, notably in cases where the translator wishes to check whether a particular item or aspect of language is typical of the language or text type, or whether it is marked in some way. This will be useful above all if the SL is not the translator’s mother tongue. With regard to encyclopoedic information, I will limit myself to pointing out that a TL corpus could, in principle, provide exactly the same information as the SL corpus. Indeed, where such corpus usage is concerned, it might be better to talk in terms of subject-matter corpora rather than language specific (SL or TL) corpora.

3. SL Corpora as sources of teaching and testing materials

A Source Language Corpus comes into its own when considered as a collection of texts which may be exploited by the teacher of translation in order to create teaching and testing materials. This approach to using a corpus in translator training differs from most studies in that it focusses, in the first instance, on what the teacher can do with the corpus rather than what the learner can do. Learner-centred approaches usually revolve around a direct relationship between the learner and the corpus, and involve activities in which trainees are given hands-on experience with corpora in order to improve their knowledge and skills in a number of different areas (see, for example, Bernardini 2000). Whereas I fully endorse the importance of this approach, I also believe that the teacher can at times act as a filter between the corpus and the students.

Within the context of translator training, and from the point of view of the teacher, a SL corpus may be considered as a repositary of samples of language - of text types, and of language features and components - which may be presented to students for translation, either to provide practice, or to provide feedback on learning. The all-important restricting feature for classroom applications, and one which will be discussed below, is that the corpus must be structured and/or tagged in such a way as to allow the teacher to retrieve samples of text relevant to what is being taught or tested. The samples of text extracted by the teacher may either be more or less complete passages or much shorter extracts. The former could be used especially for the teaching or testing of macro-skills and the latter for teaching linguistic micro-skills. However, when a particular micro-skill relates to textual features rather than, say, those of a grammatical or lexical nature, fairly long stretches of text may well be needed.

3.1 An SL corpus as a source of passages to translate

SL corpus can be exploited, quite simply, as a source of passages for translation. In this respect, it can be likened to traditional printed paper sources, but with the advantage that the desired text is already in electronic form. The latter fact means that a given text may easily be adapted as regards length, format and content. Assuming that the internal organization of the corpus allows one to do so, one will be able to make specifications regarding the nature of the text, for example, domain, text type, and date of composition.  The texts retrieved may then be used within a translation class.

I should stress that I am not advocating the use of SL texts for this purpose all the time. Very often, for example, it will be useful to work from a parallel (translation) corpus in order to compare student production with the work of experienced professionals. However, it should be remembered that a given TL text within a parallel corpus is only one of many possible translations, not a unique yardstick against which to judge the students’ work.

I should also point out, however, that there is sometimes a drawback in using texts already stored as part of an electronic corpus. This is the fact that visual aspects of the original will almost certainly have been lost at the time of corpus compilation. Layout, typographical features, and relationships between text and graphics, may all have been important parts of the original SL text but not have been carried through to the electronic corpus.

3.2 The linguistic micro-skills of translation

There are a number of different possible reasons for asking student translators to engage in the act of translation. In the following discussion I make reference to three such reasons: (i) as part of the methodology of teaching a foreign language and raising awareness of interlingual differences, [1] (ii) to help overcome problems of interference between SL and TL when translating, and (iii) to gain experience in dealing with cases of lack of correspondence between languages. I shall discuss the first two of these together since it is not always possible to distinguish between them in actual classroom practice. For example, where translation into the foreign language is concerned, a given activity may at the same time serve both to help weaker students increase their familiarization with a particular feature in the foreign language, and to provide more capable students with practice in overcoming possible SL—>TL translation interference. A related reason for considering the two together is that there will never be an exact moment in which an individual student masters an L2 feature, at least as regards use of the L2 in isolation from the L1, and has ‘only’ the problem of transfer to deal with when actually engaged in the process of translating.

3.2.1 Foreign language learning and combatting SL influence

There are many different areas where it might be useful to use translation as one of the methodologies to foster L2 learning and to help overcome SL—>TL interference. Exactly which areas will vary according to the language-pair concerned. Some examples are the following:

-  probable interlingual lack of target language part-of-speech equivalence for SL lexical items. Although a fairly common difference, and one which should cause little difficulty for the experienced translator, it is surprising how often students create unnaturalness by trying to maintain POS equivalence, perhaps encouraged by the practices of most, if not all, bilingual dictionaries.

-  combatting the tendency to create one-to-one relationships between specific SL and TL lexical items.

-  differences in typical sentence length.

-  differences relating to discourse structure and rhetoric, for example the way in which an argument is typically developed in expository text. [2]
How exactly extracts from the SL corpus are used will depend to a considerable extent on individual teaching methodology and on whether students are required to translate out of, as well as into, their mother tongue.

With regard to this use of the corpus in translator training, ideally the teacher would have access to bilingual corpora, both parallel (translation) and comparable, which could either be used as such, or else split into SL and TL corpora according to which phase of the teaching process one is in. The expository phase of teaching, for example, may best be served by a bilingual corpus. During the practice phase, however, in the absence of a bilingual corpus, the SL corpus could certainly be put to very good use.

3.2.2 Learning to handle lack of correspondence between language pairs

Sometimes it is not enough to know how one typically writes in both languages of a given language pair and therefore in what different ways an SL string may be expressed in another language. There may be significant mis-matches between languages such that features of content or style cannot easily be expressed in the other language. Some of the possible problem areas in this respect are:

-  culture-bound items of lexis or phraseology: the referent exists in the SL culture but not in the TL culture.

-  language-specific items of lexis or phraseology: while the referent exists in both cultures, the SL alone refers to it directly through a specific language component.

-  wordplay (here, we have potential, rather than actual, interlingual difference: some instances of wordplay may in fact turn out to be directly and easily translatable) [3] .

In cases such as these, I would suggest that it will very often be preferrable to use a SL corpus, even where a parallel corpus is available. By using an SL corpus the teacher can create a learning environment in which the trainee translator may more usefully use, and therefore further develop, his own ability and creativity to solve interlingual problems. The availability of bilingual corpus resources, or even the knowledge that such resources may be consulted afterwards in order to discover the ‘correct’ solution, may impede the use and development of imaginative skills[4] .

3.2.3 Linguistic micro-skills: the retrieval and exploitation of material

In the previous two sections, I have listed a number of potential foreign language learning and translation difficulties. In order to take advantage of a SL corpus with regard to such problems, it is of course imperative that the teacher be in a position to extract relevant materials from the corpus.  The major steps which must be taken in order to use a SL corpus as a source of teaching material are the following:

1)  identify a particular feature which represents a potential interlingual problem for the trainee translator.

2)  consider whether the corpus at one s disposal will contain examples of this feature, and whether it is annotated or structured in such a way as to allow examples of the feature to be retrieved without too much difficulty or redundancy using the software available.

3)  retrieve relevant examples, editing out irrelevant findings where necessary.

4)  choose suitable examples and expand the contexts if they are too short for the required purposes (this will almost always be the case if one-line keyword-in-context (kwic) concordancing is being used).

5)  construct suitable exercises.
A considerable amount of work is necessary to produce good corpus-based teaching materials, but the same may be said of any teaching materials, and once such materials have proved their worth, they may be used time and time again.

With regard to the retrieval of relevant samples from the corpus, in some cases it may be a relatively simple task. This will be the case, for example, if the translation problem being tackled is relatable directly to text type and if examples of text types within the corpus are labelled as such. Another simple case is that of problems which are strictly identifiable with particular lexical items. Since the most basic operation of the electronic corpus is that of identifying and sorting individual word forms, all the teacher will have to do is choose specific examples of the problem and search the corpus for the corresponding word. Interlingual differences which are very common may also prove very easy to locate. For example, with regard to differences in typical sentence length, the teacher may be able to find suitable SL extracts by simply browsing through texts of a particular type (rather than by writing a program which specifies a minimum or maximum number of orthographic words between full stops).

Problems of a syntactic nature will often necessitate a grammatically tagged corpus, though it will also sometimes be possible to approach this through individual word forms. Examples of wordplay could be looked for through very particular text types (e.g. advertising, headlines, and humorous writing). Without reference to such text types, single-word punning would be very hard to locate, unless of course individual instances were tagged as such in the corpus. Phrasal exploitation, on the other hand, can be located by making searches for examples of typically exploitable phrase types. Corpus-based research suggests, for instance, that many idioms and proverbs are subject to exploitation, and examples can be located by looking for one or more key words from within a given phrase - see, for instance, Cignoni & Coffey (1998) and Moon (1998: 50-51, 170-4), as well as Kenny’s (2001: 127-140) suggestions for locating corpus examples of single-word and collocational creativity.

3.2.4 Linguistic micro-skills: a sample exercise

There now follows an example of what an SL-corpus-based classroom activity might look like, together with indications of how the exercise was put together. It deals with one type of problematic interlingual area, that of culture-bound items. The aim of the exercise is to get students to use their own resources (knowledge and imagination) in order to find different possible translations for one particular item. The exercise involves translating from Italian into English, and is aimed at either English mother-tongue students (TL=L1) or Italian students with a good command of English (TL=L2). The culture-bound item is the Italian term servizio civile, a non-military alternative to servizio militare (military service).

The contexts which form the basis of the exercise come from a corpus located at the National Research Council’s Institute of Computational Linguistics in Pisa. The corpus in question is an untagged corpus of contemporary, written Italian, known simply as the ‘Italian Reference Corpus’ (IRC). At the time of consultation, the IRC consisted of approximately 16,000,000 words. Newspaper and magazine articles accounted for about two thirds of the total volume, and the remainder consisted of works of fiction and of non-fiction, the latter including technical research reports[5] .

In order to find examples of the phrase servizio civile, the separate words servizio and civile were looked for. A word family was created which allowed for a maximum of 4 intervening words and in which the order of the two words was left unspecified. Although these flexible search parameters may seem excessive, experience has taught me that it is almost always advisable with Italian multiword units to allow for some degree of phrasal variability, though how much will depend upon the probable trade-off between precision and recall [6] . More rigid search parameters would not, for example, have located the following token: “piuttosto che al servizio militare o civile” [emphasis added].

Contexts of about 125 words were automatically retrieved for each token at the time of corpus interrogation.

The search produced 19 hits, of which only one was irrelevant to the phrase being looked for. 6 were chosen to form the basis of an exercise. The decision as to which tokens to include was based on the desirability of having a variety of situational and linguistic contexts, so as to increase the probability of students’ finding different, but appropriate, ways of handling the key phrase. The contexts retained were reduced slightly in length.

In the following exercise, students are asked to work in small groups in order to promote discussion and stimulate use of the imagination. English mother tongue students are allowed to use a monolingual Italian dictionary to help them with any unknown lexical items. At the end of the activity, class discussion ensues regarding the various suggestions made.

Example of a possible classroom activity

Below you will find 5 corpus extracts containing 6 tokens of the phrase servizio civile [7] . In groups of three, decide how you would best translate the parts of the text in bold type. Try to find at least 3 different ways to deal with the phrase servizio civile.

1) La difesa della patria (che è definita come “sacro dovere del cittadino”).  L’adempimento di tale dovere non deve pregiudicare “la posizione di lavoro del cittadino, né l’esercizio dei diritti politici”(per esempio del diritto di voto ) (art. 53 c. 1 e 2). A partire dal 1972 (legge n. 772) è stato riconosciuto agli obiettori di coscienza il diritto di effettuare un servizio civile in sostituzione del servizio militare.- Il dovere di pagare i tributi “in ragione della capacità contributiva” di ciascuno (art. 53 c. 1). La Costituzione si preoccupa di impedire possibili abusi da parte della pubblica amministrazione nell’imporre tali obblighi ai cittadini ........... [extract taken from an academic text-book]

2) ... utilissima è “Come evitare il servizio militare senza eludere il proprio dovere” un libro che viene spedito contrassegno a chi ne fa richiesta presso: Aprile Ronda Editore, Salita di Riva 3, 13051, Biella, tel. 015/21960. Qui sono contenute tra l’altro tutte le disposizioni di legge per l’obiezione di coscienza. Tutti possono sostituire i 12 mesi di servizio militare con 20 mesi di servizio civile, basta che non abbiano mai avuto licenze o autorizzazioni per il possesso di armi né, tantomeno, condanne per detenzioni o porto abusivo di armi proprie. Anche in questo caso ci vuole un minimo di senso civico. Mai, assolutamente mai dichiararsi obiettori di coscienza se invece non lo siete: la legge 772 che regola l’intera faccenda nonostante modificazioni e decreti presidenziali successivi è nata nel 1972 e secondo molti è ormai inadeguata a gestire il fenomeno dell’obiezione. Il che vuol dire che lo scorso anno sono state respinte oltre 10.000 richieste di servizio civile alternativo. Domanda: quanti tra i 27.393 che ce l’hanno fatta sono veramente antimilitaristi e quanti tra i 10.000 scartati sono semplicemente degli sfortunati?  [family magazine]

3) costruito un parcheggio da cinquecento posti e lo gestisce in proprio, facendo pagare il “posto macchina”. Secondo l’associazione costruttori edili di Torino, questa soluzione (tutti i costi della costruzione del parcheggio all’impresa, ma anche tutti gli utili) potrebbe essere vantaggiosa per tutti oltreché molto rapida. INDUSTRIA MILITARE I verdi chiedono di ridurre le spese militari, di trasformare il servizio militare in servizio civile, orientato verso il lavoro di risanamento ambientale, di riconvertire la produzione bellica verso usi civili. Le proposte considerate in qualche maniera realizzabili riguardano forme di riduzione della leva militare, mentre nessuno pensa veramente che in tempi brevi si potranno riconvertire le ricchissime industrie delle armi. Qui la battaglia si giocherà soprattutto sulla “sensibilizzazione dell’opinione pubblica”, in specie quella cattolica.  [family magazine]

4) La minaccia tossica si estende ad altri tre comuni. La nube corre dal Nord al Sud dell’agglomerato portuale. L’allarme si fa generale. Una cellula di crisi entra in funzione alla prefettura dove viene collocato lo stato maggiore dell’apparato della sicurezza civile. Nel primo pomeriggio un migliaio di uomini delle forze dell’ordine e dell’esercito, elicotteri ed équipe di tecnici e chimici del servizio civile fatti affluire in gran fretta da ogni parte della Francia sembrano avere in mano la situazione. Due squadroni di gendarmi hanno l’incarico difficilissimo di incanalare e mantenere l’ordine nella folla dei fuggiaschi. Cinquantamila persone fatte evacuare, che debbono abbandonare in qualche minuto le loro abitazioni, migliaia di bambini strappati dalle scuole.  [daily newspaper]

5) .... di Vichy ottennero soltanto che i deportati fossero 20.000, e non 50.000, e che venissero rinchiusi nel campo di concentramento di Frejus, sotto la loro sorveglianza, anziché in quello di Compiègne, controllato dalla Gestapo. Né meno accanita e feroce fu la caccia all’uomo nell’intera Francia per il reclutamento forzato della manodopera. Benché le terribili condizioni di vita spingessero più d’uno ad arruolarsi nel servizio civile per i tedeschi e ad accettare il trasferimento in Germania (a Parigi, nell inverno ‘43- ‘44, i cittadini ricevevano in media 200 grammi di grassi e 300 grammi di carne al mese; il costo della vita era salito del 1660 per cento e le razioni giornaliere erano scese a 850 calorie), mancava sempre, al Gauleiter Sauckel, lo “zar della manovalanza schiava”, almeno mezzo milione di operai francesi….  [semi-specialized periodical].

It is to be noted that in activities, such as this one, where the focus is on particular language features, the actual amount of text that the students are asked to translate may be much shorter than the stretch of text they are asked to read. The amount of context included should be sufficient to help students make decisions about how to translate the shorter stretch of text [8] .

4. Concluding remarks

In this paper I have tried to show that a source language corpus can have a role to play in translator training. Its principle function is that of constituting a potential source of teaching and testing materials. In what exact way it may be used, will depend on a variety of factors, notably the teacher’s preferred methodology, the relationship between a given language pair, the composition of the corpus and type of annotation, and the nature of the interrogation software available.

I have suggested that at times a SL corpus may actually be preferrable to a bilingual corpus, and at other times it may not be of great significance whether the bilingual corpus is available or not (especially in passages for testing). Also, where the presence of a bilingual corpus may have been preferrable but is not available, a SL corpus can still sometimes have a useful role to play. This is by no means an unimportant point since, at the time of writing, monolingual corpora by far outnumber their parallel counterparts.


[1]  It could be argued that foreign language learning is a process distinct from that of learning to translate and that the former therefore has no place in a discussion of translator training. This would certainly be true if one considered only the case of perfect bilinguals. However, it is probably fair to say that in most, if not all, translation courses there is a symbiosis between translation and L2 teaching and acquisition: translation leads to better knowledge of the L2 and of its relationship with the mother tongue, and this knowledge leads to better translating. For this reason it would seem reasonable to discuss foreign language learning in the present context.

[2] Some examples of how contrastive rhetoric may be incorporated into a translator training course are to be found in Colina, 1997

[3] Some discussion of wordplay within the context of translator training is to be found in Ballard (1996).

[4] See Kussmaul (1995: 39-53) for some discussion relating to the role of creativity in translator training.

[5] The author would like to thank Prof. Antonio Zampolli of the Institute of Computational Linguistics in Pisa for permission to consult the Italian corpus mentioned in this study

[6] Precision and recall are explained by Jeremy Clear (1993: 275) in the following way:  “Precision is the measure of success of the system in retrieving only the items of interest to the user….. Recall is a measure of the system’s ability to find all the interesting items out of the information base ......”.

[7] This item is, potentially, a false friend, but it should not prove to be one for the students whom the exercise is aimed at.

[8] Some of the organizational and didactic features of this exercise are also advocated by Duff (1989: passim), notably the importance of group discussion, the use of a number of short texts containing the same feature, and the actual translation of only one small part of the text.

About the author(s)

Stephen Coffey è Ricercatore di Lingua e Traduzione Inglese presso la Facoltà di Scienze Politiche e il Dipartimento di Anglistica dell’Università di Pisa. La sua attività di ricerca riguarda principalmente la fraseologia, la lessicografia, la metodologia della linguistica dei corpora, la didattica delle lingue, la contrastività e la traduzione.

