Corpora and the Innocent Translator

How can they help him [1]

By Stella E. O. Tagnin (Universidade de São Paulo, Brazil)

Questo articolo descrive un esperimento condotto all’università di São Paulo in Brasile in cui studenti di traduzione sono stati istruiti a creare corpora di linguaggi specifici, ricavandoli da testi a stampa o dalla rete Internet. Successivamente questi corpora sono stati utilizzati per ricavarne dei glossari di collocazioni. L’esperimento si basa sul presupposto che le collocazioni (definite in temini sintattici piuttosto che funzionali) sono uno degli aspetti più difficili di una lingua. L’uso dei corpora può aiutare gli studenti di traduzione a migliorare le proprie abilità e impedire che si comportino come “traduttori innocenti”.


This paper describes an experiment conducted at the University of São Paulo in Brazil with a class of translation students, who were guided to create domain-specific corpora of texts (scanned/typed in from printed texts and/or downloaded from the Internet) and extracted from them collocational glossaries. The experiment is based on the assumption that collocation (defined in syntactical rather than functinal terms) is one of the most diffucult aspects of language. Learning to use corpus resources may help trainee translators to improve their skills and prevent them from acting as “innocent translators”.

Keywords: translator training, corpus linguistics, corpus-based translation studies, collocation, translator and interpreter training

©inTRAlinea & Stella E. O. Tagnin (2002).
"Corpora and the Innocent Translator"
inTRAlinea Special Issue: CULT2K
Edited by: Silvia Bernardini & Federico Zanettin
This article can be freely reproduced under Creative Commons License.
Permanent URL:

1. Introduction

Bowker (1998) has pointed out that the use of specialized native-language corpora was influential in improving the quality of a translation in terms of “correct term choice, and idiomatic expression” (1998: 648). This paper reports on an experiment that was initially intended to investigate whether her findings could be extended to general language as well.  Midway, however, it gained momentum and ended up as a corpus and glossary-building project, in a way similar to Maia (2000).

If we understand Bowker’s “correct term choice” as collocation and “idiomatic expression” as native-like production or naturalness, we will realize that both aspects are part and parcel of what is known as “conventionality in language”, lack of knowledge of which characterizes an “innocent speaker” (Fillmore 1979).  It will be seen that, given similar circumstances, a translator can be equally “innocent.”

2. The innocent speaker

Fillmore coined the term in relation to the foreign language learner.  For him, an innocent speaker is a “compositional” speaker, that is, a literal speaker who does not know the conventions of a language.  Innocent speakers do not know, for example, that prisoner and jailer have different meanings.  Why should they?  After all both words are formed using a base, prison and jail both meaning “a building where wrong-doers are locked up”, plus the agentive suffix – er. So, why is a prisoner “a person kept in a prison”, while a jailer is “a man in charge of a jail”?

Also, an innocent speaker does not know, among other things, the preferred order of binomials like cats and dogs, bed and breakfast, knife and fork.  He or she does not know, either, that there are certain fixed or semi-fixed combinations of noun plus noun (credit card, quality control, cost of living), noun plus adjective (nursing home, silent movie, elementary school), noun, as subject, plus verb (a river flows, a volcano erupts) or verb plus noun as object (pay a visit, ask a question, make a decision), verb plus adverb (pay dearly, cry loudly, hurt badly), and adjective plus adverb (deeply hurt, happily married, lavishly illustrated). Nor does he or she know the formulas of the language, mainly routine formulas (Good evening, Have a nice day, I’m really sorry) and situational formulas (Break a leg, It takes one to know one, Have it your way).

In short, an innocent speaker is not aware of the fact that a very large part of language is made up of prefabricated chunks, of ready-made expressions, phraseological units which do not have to be generated every time they are used.

Depending on the situation, however, anyone can be an innocent speaker in his or her own language:  how would a lay person, for example, know the technical terms (mostly collocations) of certain professions (medicine, law) or know what to say (use the correct formulas) in unknown situations (like a funeral, if one has never been to one)?

But it is when contrasting two languages that these conventions will stand out more strikingly.  And that is when the translator comes into the picture.

3. The innocent translator

Basically, a translator’s “innocence” amounts to a compositional understanding of meaning and a lack of awareness of the extent to which language is made up of such prefabricated chunks.

The translator’s innocence can transpire both in his comprehension and in his production abilities.  In terms of comprehension, he may not be able to understand idiomatic expressions like a hard nut to crack, put one’s best foot forward, or cut corners, for they are non-compositional, the total meaning of the expression is not the sum of the individual meanings of its components.  He may not understand many discursive formulas as he may not know the social conventions that determine their use in the target language.  And he may not understand humorous remarks which result from manipulation of conventional categories of language. For example, he will not understand puns like “fish and chimps” (unless he knows the binomial fish and chips), or “Ear today, gone tomorrow” (after Here today, gone tomorrow) in an article on the boxing match in which Mike Tyson bit off a piece of his opponent’s ear.

Strange as it may seem, even as a native speaker of the target language, he may have trouble on the production level to achieve native-like renderings.  He may stick so closely to the source text he may not be aware that, among equally grammatical forms, there is one preferred option.  In other words, he may not realize that from an array of grammatically possible forms there are certain forms which have a higher probability of occurring. If a translator selects one of these possible forms to the detriment of the most probable one he will produce a non-native-like translation, a translation which lacks naturalness. This problem is undoubtedly magnified if he is translating out of his native language.

In that respect, collocations and formulas rank highest in terms of difficulty. In the case of collocations, the difficulty may be due to the fact that they generally do not constitute comprehension problems so that they tend to go mostly unnoticed. That is to say, being mostly compositional, collocations are easily understood.  However, when it comes to producing them,  they are not as easily retrieved from memory.

There are various types of lexical collocations. Following Hausmann (1985) and using a syntactic rather than functional terminology, I will regard collocations as formed of a base, usually a noun, plus a collocate.  The collocation will derive its name from the collocate. So, a “verb + noun” collocation will be a verbal collocation, an “adjective + noun” will be an adjectival collocation, and so on.

Nominal and adjectival collocations certainly are the bulk of the collocational inventory. There are myriads of them and more come up everyday as they are used to name new technologies, processes, theories (e.g. computer aided design, computer graphics, computer assisted language learning, corpus linguistics, translation studies, data storage), and new objects and products (mouse pad, video game, food processor, video camera, London Eye, RealPlayer, RealJukebox).  Only rather specialized dictionaries will list such new occurrences.

Verbal Collocation are fewer in number and hardly found in general dictionaries.  Worse even is the fact that when they are listed at all, they are usually listed under the verb, which is exactly the “unknown”.  In Portuguese, for example, we say marcar uma consulta (“make a doctor’s appointment”) or marcar um encontro (“make an appointment” with someone).  But we also say marcar uma reunião, which in English is “call a meeting”.  At conferences, in Portuguese, we can fazer uma comunicação or apresentar um trabalho whereas in English one of the options is “to give a paper” (“dar um trabalho” is unacceptable in Portuguese!).

4. Collocations in dictionaries and corpora

How can the translator go about finding an adequate translation for collocations?
There is a number of reference sources, such as mono- and bilingual dictionaries of idiomatic expressions (Boatner & Gates 1975, Spears 1988, Spears 1989, among others, for English; Serpa 1982, Camargo & Steinberg 1989, 1990 for the pair English-Portuguese) and a handful of dictionaries of formulas (Partridge 1977, Spears et al. 1995, Spears 1996) and of collocations (mainly Cowie et al. 1983, Benson et al. 1986,  Hill & Lewis 1997).

Most languages, however, lack this kind of reference source for collocations: to my knowledge, there are only a Japanese (Akimoto et al. 1993) and a Chinese (Longman 1995) version of The BBI Dictionary of English Word Combinations (Benson et al. 1986). The picture becomes even gloomier in the case of bilingual dictionaries, a notable exception being the Russian-English Dictionary of Verbal Collocations (Benson & Benson 1993). A similar dictionary of verbal collocations for Brazilian Portuguese and English, in both directions, is in the making at the University of São Paulo (Tagnin 2000).

However, it would be a very “innocent” idea, indeed, to believe that a dictionary could solve all the translator’s problems with regards to the conventionality of language use.
Even the few available dictionaries will only provide a restricted list of occurrences. As an test case, I looked up the word “computer” in three different dictionaries and in two corpus resources.

The BBI Dictionary of English Word Combinations (Benson et al. 1986), under “computer” lists the following nominal and adjectival collocations:

6. an analog; desktop; digital; electronic; general-purpose; handheld; home; laptop; mainframe ~; [...] parallel; personal; serial ~ (p. 72). 

The LTP Dictionary of Selected Collocations (Hill & Lewis 1997) lists only:

home, laptop, mainframe, palmtop, personal ~ (p. 51). 

The Longman Dictionary of English Language and Culture (1993 edition) has the following:

computer-aided design, computer dating agency, computer game, computer graphics, computer hacker, computer modelling, computer programmer, computer science and computer virus. 

One corpus-related resource we used was a ready-made tool based on The Bank of English, Cobuild’s English Collocations on CD-ROM, featuring 10,000 headwords with up to 20 collocates for each one.  A search for “computer” yielded the following results :

I looked into the examples for each collocate and came up with some more collocations: computer hardware, computer manufacturers, computer-products company, computer marketing research company, computer services company, computer software company, computer-security industry, computer video games, computer systems, computer workstations, computer-driven programs, computer-reservation, computer store, computer service business, computer-based information system, computer databases, computer information system, computer information network, computer-based graphics package, computer-based system and computer-based service.

Next I resorted to WebCorp (, an online search tool which uses the Web as a corpus, producing concordance lines based on results from the search engines Altavista, Yahoo and Metacrawler.  A search for “computer” produced 134 concordances from 60 web sites found by Altavista.  The most frequent collocations were: computer systems (9), host computer (8), computer service (7), digital computer (4), electronic computer (3), computer hardware (3), computer store (3), and computer keyboard, computer design, computer center, computer field, computer products, computer dealers and computer software (2 occurrences for each). Of all of these collocations (including single occurrences as well) the only ones mentioned in the dictionaries we consulted were


digital computer

electronic computer

general-purpose computer

mainframe computer

mainframe computer

computer game

computer programmer

computer science

It is worthy of notice that by the time certain collocations are registered in dictionaries they may be on their way out of fashion, as is the case of personal computer which has been mostly replaced by PC, or desktop computer which has become simply desktop (plural desktops). No occurrences were found for these two collocations among the 134 concordance lines generated by WebCorp.

5. The corpus-building experiment

From the above, it seems clear that corpora are a fundamental resource for ensuring a natural translation.  For this reason after giving my students an introduction to the phraseological component of language and the problem it poses for translation,  I decided to engage them in building a small corpus from which they were to extract all possible phraseological units (collocations, binomials etc.) and present them, as a term paper, by categories.

The 48 students were divided into 11 groups, each of which, for practical purposes, chose a field to research.  The fields ranged from fairly general ones like “fashion”, “cooking”, “advertising”, “beauty” and “public health” to highly specialized fields like “biotechnology”, “finance”, “telecommunications”, “accounting”, “computer science” and “law”.  Within each field, they chose a more specific topic for which they were asked to produce a corpus.

Having started collecting the texts, they soon realized they had to further narrow down their area of research, otherwise they would not have been able to handle the wealth of material they had assembled in their initial enthusiasm.

After this had been accomplished, each group chose a short text to work on.  As a first step they were asked to identify all collocational occurrences in the text and try to translate them.  Every week a different group presented their results to the class, first to discuss whether the units they had identified were actually phraseological, second to ensure they had attained a reliable translation.  By “reliable” was meant “conventional” (or “idiomatic”, in Bowker’s 1998 terms), an acceptable translation in the sense that it was the actual combination in use in that field of research.  In other words, if the translated term had only been found in a dictionary, it had to be further documented in an authentic context.

This is when the corpus came into the picture.  Each group was asked to compile a corpus of approximately 200,000 running words, 100,000 in each language.  During the process, this proved far too ambitious though a couple of groups came quite close to that number. The texts were to be original texts or translations in English and Brazilian Portuguese and each text was to be identified as to source, language and original vs. translated.
Next, students were asked to enlarge their list of all types of collocations by searching their corpus. Here is a breakdown of their findings.

• Nominal and adjectival collocations. It turned out that nominal and adjectival collocations were the best represented categories, certainly because they are used to “name things”, as has been mentioned.  For this reason, they made up the bulk of the phraseological units, mainly in the areas of telecommunications, fashion and accounting.

• Verbal collocations.  These were more prominent in areas in which processes were discussed, like cooking, computer science, insurance contracts (law) and food engineering (biotechnology).

• Binomials.  Most groups reported just a few occurrences, one or two only.  However, they were more common in cooking (2 in Portuguese and3 in English), fashion (2-5), finances (4-4) and especially accounting (11-10) texts.
Though the focus of the course was not on technical language per se, it turned out that most of these phraseological units consisted of technical terms within the area investigated.  This lead to my suggestion of organizing them into a glossary, which the students readily accepted.  Thus, besides building a comparable bilingual corpus each group presented a glossary ranging between 50 and 200 terms in each language. The “advertising” group, however, went far over that mark, compiling 327 terms in English and 244 in Portuguese.  The glossaries presented the equivalent terms with authentic examples in both languages.  No definitions were required.

6. Overall evaluation

Because the project gained an originally unforeseen extension, I will discuss each task separately.

Collection of phraseological equivalents
With reference to the original aim of the project, there was a consensus among all groups that the experiment was extremely valid in raising their consciousness to an aspect of language heretofore unknown to them: meaning is not always compositional, very often words derive their meaning from “the company they keep”, as Firth has put it, from the words they co-occur with.  More specifically,
1. they became aware of the pervasiveness of phraseology in language;
2. they learned to identify phraseological units, mainly by their recurrent co-occurrences;
3. they understood source text phraseological units could/should whenever possible be translated by a target language phraseological unit as well to ensure naturalness;
4. they realized most dictionaries were a poor source for phraseological equivalents.
5. they found that even a small corpus of well chosen texts could be highly useful in furnishing natural equivalents.

Corpus-building skills
Because the original assignment was enlarged and adjusted midway, students sometimes felt at a loss as to what was expected of them.  In fact, most of them had to learn the corpus-building techniques the hard way, in their own time, with their own equipment and resources.  However, in the process
1. they learned they had to narrow down their topic of research and be more judicious in collecting their texts.  Most of them had started out collecting their material without any criteria only to realize a large part would be unsuitable for their purposes.
2. they realized that “traditional” texts, if they were to be included in the corpus, had to be either typed in or scanned.  Typing was highly time-consuming and scanning posed technical problems in terms of equipment (hardly anyone had access to a scanner) and in turning the resulting scanned “image” into a machine-readable text. The insurance group, however, due to the specificity of their topic, had to rely solely on traditional texts, which had to be typed in one by one.
3. they became aware of the richness of the Web as a source of electronic texts, but also realized texts were much more abundant in English than in Portuguese, which, in some instances, forced them to resort to “traditional” texts after all.
4. they realized early on that computer skills were an absolute must.  Besides learning to surf the Internet and mastering Word and Excel, they were also required to be able to use specific search tools like the Simple Concordance Program ( they claimed was far too slow), or IntraText, a free service by which you submit a text via e-mail and get it back within a few minutes with all sorts of lexical information, including collocations.  The problem it poses, however, is that all files are zipped, requiring a specific software to unzip them, which not all students owned.
5. they learned how to run online searches on the WebCorp and Cobuild sites, either to confirm certain collocations or to find new ones.
6. Finally, they learned they had to identify their texts and organize them in a hierarchical structure to be able to retrieve them according to their needs.  The structure was provided by the teacher.

7. Conclusion

Despite the fact that students complained that the overall project was too complex to be satisfactorily completed within the short period of time allotted to it (one four-month semester broken up by an almost two-month interruption due to a faculty strike), they agreed that it was a valuable experience for them (I must confess it was for me too.) But, above all, they were aware that their newly acquired search and corpus-building skills were there to stay and would be instrumental in preventing them from acting as “innocent translators.”


[1]  An earlier version of this paper was presented at The Lodz Session of the 3rd International Maastricht-Lodz Duo Colloquium on “Translation and Meaning”, Lodz (Poland), 22 – 24 September 2000

