Issues in Computer-Assisted Literary Translation Studies

By Federico Zanettin (Università di Perugia, Italy)

Abstract & Keywords

In this article I first examine the application of computers to the study of literary texts and then provide an overview of how corpus linguistics tools and resources have been applied to the analysis of literary translation and translators. I discuss some of the issues involved in corpus design and construction, and consider some areas which may benefit from a corpus-based approach, including literary translation criticism and the study of translator style, one area that has received more attention than others. Finally, I suggest that quantitative and qualitative perspectives on computerized analysis of literary translation should converge and be integrated. 

Keywords: corpus-based translation studies, parallel corpora, stylometrics, corpus linguistics, literary translation

©inTRAlinea & Federico Zanettin (2017).
"Issues in Computer-Assisted Literary Translation Studies"
inTRAlinea Special Issue: Corpora and Literary Translation
Edited by: Titika Dimitroulia and Dionysis Goutsos
This article can be freely reproduced under Creative Commons License.
Stable URL:

1. Computer-assisted literary studies

Computer-assisted studies of literature started in the 1960s, taking advantage of the capability of computers to identify strings and patterns in electronic texts. With the help of computers, literary scholars could quickly access a list of all words in a text, and for each word access all the contexts in which that word appeared. While the use of electronic media as a means of scholarly analysis is still somewhat controversial among literary critics (Rommel 2004, Cohen 2010), digital humanities have been developing steadily over the years. Perhaps two main research areas can be discerned: the first revolves around the creation of electronic scholarly editions, thematic research collections and archives. Electronic versions of literary works are resources that facilitate access to texts by allowing for full text searches. Furthermore, the integration of metadata and multimedia in a dynamic hypertextual environment grants the researcher access to information about the text, for instance transcriptions and annotations from historical and scholarly editions, the reproduction of original sources such as manuscripts, and so on (Price 2008). A second area of investigation focuses on textual analysis within the theoretical framework of literary-linguistic stylistics: “(i)n this context, texts are seen as aesthetic constructs that achieve a certain effect (on the reader) by stylistic features on the surface structure of the literary text” (Rommel 2004: 89).

Approaches to computer-assisted stylistics vary: on the one hand, access to the text in electronic format is seen as a help to navigate and retrieve passages and citations in context in order, for instance, to explore imagery and characterization in a novel. The use of text analysis software is here functional to ‘traditional’, qualitative textual analysis, supplementing, corroborating and perhaps even challenging critical and interpretative reading of the whole text. On the other hand, more quantitatively-oriented approaches involve the use of statistical methods of analysis, based on frequency counts of any textual feature that can reliably be identified. Such studies are generally known as stylometry and have been successfully used in authorship attribution, both for legal purposes (forensic linguistics) and in literary-linguistic studies. They are based on the idea that each text bears the hidden signature of its author, the trace left by unconscious linguistic habits which is recoverable by comparing patterns in the use of textual features. As Hoover (2008: online) explains in his overview of quantitative approaches to literary studies,

Words …, as the smallest clearly meaningful units, are the most frequently counted items, and syntactic categories (noun, verb, infinitive, superlative) are also often of interest, as are word n-grams (sequences) and collocations (words that occur near each other). Thematic or semantic categories (angry words, words related to time), while more difficult to count, have the advantage of being clearly relevant to interpretation, and automated semantic analysis may reduce the effort involved. Phrases, clauses, syntactic patterns, and sentences have often been counted, as have sequences or subcategories of them (prepositional phrases, subordinate clauses, passive sentences). Many of the items listed above are also used as measures of the lengths of other items: word length in characters, sentence or clause length in letters or words, text length in words, sentences, paragraphs, and so forth.

One of the most successful procedures in authorship attribution was put forward by John Burrows (2002a) and is known as Burrows’s Delta. This simple but usually accurate measure of textual similarity and difference is based on the relative frequencies of the 150 most common words, and “is designed to pick the likeliest author of a questioned text from among a relatively large number of possible authors” (Hoover 2008: online).[1] Two further measures, introduced at a later date (Burrows 2006), are called Zeta and Iota, and focus on middle and low frequency words. Burrows applied his methodology to a number of research areas including translation, as will be seen in section 2.

Within translation studies, some research can be examined within the framework of literary translation criticism. As Paloposki (2012) explains, translation criticism can be understood as the (journalistic) practice of reviewing translations, as “related to prescriptivism or values and ideological choices” (Paloposki 2012: 185), or as a hermeneutic tool for the non-judgmental evaluation of literary translations. In this latter sense, “it explores the interpretative potential of a translation [looking] at degrees of similarity to or divergence from the source text’s perceived interpretative potential” (Hewson 2011: 6-7). Much research based on parallel corpora can indeed be seen in the light of such an approach. Some studies have focused on one parallel text, for instance Munday (1998), who uses a computerized version of a short story by Gabriel García Marquez and its German translation by Edith Grossman, and Mahlberg’s (2007), who looks at Gustav Meyrink's translation into German of Charles Dickens’ Bleak House. In most studies, however, one or more source text/s is/are accompanied by multiple translations by different translators, in what Johansson (2003) later termed “star” model parallel corpora.

One of the first studies to introduce a computer assisted study of literary translation was Maczewski (1996), who proposed the acronym CoALiTS (Computer Assisted Literary Translation Studies) to denote a field of research combining literary and linguistic computing with literary translation. He exemplified this approach through a case study, a computer-assisted analysis of the first chapter of Virginia Woolf's The waves, together with two French and three German translations. He created a small multilingual parallel corpus, which was linguistically annotated for lemma and part-of-speech, aligned at phrase level and analyzed through the use of the PALIMPSEST suite of programs written by the author himself. The application provides for parallel browsing and concordancing, and for deriving statistics about word and phrase order, which were used to compare the various translations in relation with the source text. Thus, Maczewski found that, while all German translations followed closely the wording of the English source texts, the two French translations differed more amply, one being “clearly target-oriented” and reading smoothly in French, the other giving priority “to the original's syntax and lexis” (Maczewski 1996: 183), and letting the source language “shine through” (ibid.: 184). Maczewski explained this partly as a consequence of closer structural similarity between English and German than between English and French, and partly as a result of different translation strategies.

Bosseaux (2001, 2004, 2007) investigated how translators manifest their ‘voice’, that is their discursive presence and point of view, in a number of French translations of two of Virginia Woolf’s novels, The Waves and To the Lighthouse. Using both monolingual and parallel concordances, Bosseaux considered features such as personal pronouns, time and space adverbials and verb tense in order to investigate deixis, modality, transitivity and free indirect discourse, and determine “whether and how the translator’s choices affect the transfer of narratological structures” (Bosseaux 2004: 108). Winters (2004, 2007, 2009) studied a number of style-related features in two different German translations (by Hans-Christian Oeser and Renate Orth-Guttman) of F. Scott Fitzgerald’s The Beautiful and Damned. Her analysis focused on modal particles, foreign words, code-switches and speech-act reporting verbs, and showed how the two translators differ significantly in their use of these features.

As noted by Huang and Chu (2014: 127), studies based on corpora containing different translations of the same source text have the advantage of detecting the influence of source text features on target realizations, but at the same time they define the concept of style in translation as a measure of difference from the source text. Thus, although they

demonstrate that individual translators can adopt quite different approaches to the translation of the same source text, their results do not reveal whether the patterns they identify are indeed consistent stylistic traits in each translator's work, rather than reflecting personal and circumstantial interpretations of a specific text. (Saldanha 2011b: 33)

Dorothy Kenny (2001) used a corpus of English translations of “experimental” novels by German authors to study lexical creativity in translation. She used the German-English Parallel Corpus of Literary Texts (GEPCOLT) in conjunction with two general reference corpora, one for German and one for English, to investigate whether translators resorted to lexical creativity to the same extent as source text authors, or rather showed a tendency towards normalization (one of the hypothesized universal features of translation, see below). Kenny looked at how hapax legomena (words only used once), writer-specific forms, and unusual collocations in the source texts are translated, and her results suggested an overall tendency towards normalization as concerns single words, and a combination of linguistic creativity and normalization– depending on individual translators – as concerns the translation of creative collocational clusters. Kenny was primarily interested in normalization as a general feature of translated texts rather than in the study of translator style, but her results also point to patterns of consistent variation in the linguistics habits of individual translators.

While corpus-based analyses of source texts and translations can be certainly carried out using plain text corpora, the enrichment of such corpora with lexical, syntactic and semantic annotation can in many cases facilitate or enhance the analysis. Furthermore, as stressed by Rommel (2004: 92) “[t]he importance of markup for literary studies of electronic texts cannot be overestimated, because the ambiguity of meaning in literature requires at least some interpretative process by the critic even prior to the analysis proper”. While much linguistic annotation can to a large extent be implemented automatically through existing NLP resources (Zanettin 2012), information regarding other textual features relevant to the analysis of literary translations can only be added through some degree of manual intervention. One interesting case, for instance, concerns the annotation of proper names. Van Dalen-Oskam (2013) tagged names in 10 English novels and their Dutch translations, and in 10 Dutch novels and their English translations arguing that, while onomastic research in literary stylistics has mostly been qualitative - focusing on the meaning of particularly significant names-, computer-aided quantitative onomastics may also provide interesting results. She found, for instance, that English translators from Dutch tend to add more name mentions (with respect to source texts) than Dutch translators when translating from English. Van Dalen-Oskam’s (2013) study highlights one important aspect of annotation: after trying to apply existing named entity recognition and classification (NERCs) tools to speed up the tagging of names, she discovered that they would not work properly with literary texts, and eventually had to resort to manual tagging. In order to be reliable, manual name tagging had to be carried out by Dutch and American scholars each working interpretatively with texts in their own language/culture, which required, in turn, an assessment of inter-coder agreement.

A type of annotation, which is specific to parallel corpora, is alignment annotation. In fact, while it is possible to compare features of source and target texts independently, for example, the frequency of specific words or the respective STTR, parallel corpora are best exploited when the researcher can access parallel segments using appropriate visualization tools. This is especially the case for qualitative oriented research which privileges the interpretation of individual instances, whose retrieval and analysis is facilitated by browsing and concordancing aligned parallel corpora. Various systems for the automatic alignment of parallel texts are now available, which reach a very high accuracy rate in mapping correspondences between “equivalent” segments and creating bitextual “translation units” (Zanettin 2012). However, literary texts seem to be especially “resistant” to alignment (Mikhailov 2002), making this a very time consuming enterprise, which also explains why parallel corpora of literary texts are a scarce commodity. The type of difficulties involved in breaking up a text pair to obtain parallel segments can be appreciated by considering a paragraph from a Salman Rushdie’s novel, Midnight’s Children, and its Italian translation (I figli della mezzanotte) by Ettore Capriolo (Zanettin 2001).

'A jewel,' he said, honking into a handkerchief, 'Sir and Madam, your daughter is a jewel. I am humbled, absolutely. Darn humbled. She has proved to me that a golden voice is preferable even to golden teeth.’

"Un gioiello," disse, strombettando nel suo fazzoletto. "Signore, signora, vostra figlia è un gioiello. Sono mortificato. Assolutamente mortificato. Mi ha dimostrato che una voce d'oro è preferibile persino ai denti d'oro."

Both an automatic aligning system and a human researcher would probably find it difficult to link together sentence/segment pairs from the above paragraphs. Based on punctuation, in fact, an automatic system may find four “sentences” in the English paragraph and five in the Italian one, corresponding to the number of sentence final full stops. The difference in segmentation between Italian and English can be mostly traced to different conventions in the rendering of direct speech: whereas in the source text the narrator’s comment is embedded in the first sentence of the dialogue, in the translation it separates that statement in two sentences. Additionally, the second and third sentences in the source text (corresponding to the third and fourth sentences in the translation) are very short, and they are closely interrelated through the repetition of the word “humbled”. Thus, they should perhaps best be interpreted as a single bitextual segment, which includes a full stop. This shows that the optimal size and boundaries of alignment units (i.e. bitextual segments) in parallel corpora of literary texts cannot be decided in advance on the basis of formal features, and that in order to fine tune the alignment for subsequent analysis of parallel concordances a great deal of manual editing may be involved. As a side benefit of this, however, manual alignment editing can also be seen as an essential analytical stage that helps the researcher to spot and categorize features worth investigating (in this case, for instance, patterns of repetition and parallel structures).

2. Corpus-based literary translation studies

Studies in corpus-based literary translation include translation criticism, but also general investigations of literary translation norms, such as Øverås (1998); the assessment of how differences in language and style between two translations of the same texts produced at different times relate to language change, as in Ji’s (2012) analysis of two late 20th century Chinese translations of Cervantes’ Don Quijote; the study of indirect translation envisioned by Zubillaga et al. (2015) when creating their trilingual parallel corpus of literary translations from German into Basque, with Spanish as a relay language. However, the majority of corpus-based studies on literary translation have perhaps been carried out in the context of research concerning “translator style”, defined as a coherent and motivated patterns of choices “recognizable across a range of translations by the same translator”, which “distinguish that translator’s work from that of other translators” and which “cannot be explained as directly reproducing the source text’s style or as the inevitable result of linguistic constraints” (Saldanha 2011a: 240). Different investigations have compared a range of features across various types of corpora. Quantitative procedures applied to the analysis of translator style refer to the frequencies and distribution of lexical or syntactic features, and are not dissimilar to those used in computational literary stylistics more generally, as outlined by Hoover (2008).

While most studies mentioned so far have centred their analysis on the comparison of parallel corpora of source texts and translations, Baker (2000) proposed that an approach based solely on a corpus of translations, such as the Translational English Corpus (TEC), could be used to investigate the style of literary translators. She suggested that by comparing between the sets of texts produced by different literary translators working in the same language from different source languages it may be possible to identify distinctive linguistic habits, manifested as consistent patterns of stylistic variation. Baker shifted the focus of the analyses away from the congruence between the style of the translations and that of the source text, onto individual literary translators, to investigate whether they can be shown to use distinctive styles of their own. She argued that corpus-based research should “explore the possibility that a literary translator might consistently show a preference for using specific lexical items, syntactic patterns, cohesive devices, or even style of punctuation, where other options may be equally available in the language” (Baker 2000: 248). Style is here defined, not as a way of responding to the source text, but rather, much as in literary-linguistic stylistics and forensic linguistics, as a “thumb-print” resulting from conscious or unconscious linguistic choices. In a tentative experiment, Baker (2000) applied to two sets of works by Peter Bush and Peter Clark (translating from Portuguese and Spanish, and from Arabic, respectively) some basic statistical measures that had already been used in other studies as indicators of translation universals. These are standardized type/token ratio (STTR) and average sentence length (used as indicators of simplification, Laviosa 1997, 1998) and frequency of reporting structures introduced by the English verb say (Olohan and Baker 2000). However, “although Baker finds interesting patterns in the work of both Peter Clark and Peter Bush, because her corpora do not include source texts it is difficult to prove conclusively that these patterns are not carried over from the source text” (Saldanha 2011b: 32). This point was later further investigated by Huang and Chu (2014), who applied Baker’s (2000) measures to a corpus of literary translations by two translators, both of whom translated from Chinese into English. They found little variation in the STTR and mean sentence length in the two sets of translations, concluding that these statistical measures do not seem to be reliable as indicators of translator style. Noting also that the distribution pattern of the reporting verb say in all its forms is similar for both translators and to the average for the TEC corpus, they suggest that the more marked differences in Baker’s study are probably due to the difference in source language, as Baker herself had suggested it may be the case. Zanettin (2000) shows how the source language variable could be kept under control by relating measures of STTR for different translators to the average STTR in large reference corpora of translations, for each of the source languages involved.

Huang and Chu (2014) also compared results for STTR, average sentence length and frequency of reporting verbs in a corpus of several translations by different authors from different source languages[2] with those from a corpus of several works by a few English writers, and found that there was much less variation among translations than among works originally written in English. A similar trend was also noted by Rybicki (2006), who analyzed two different English translations of three novels by the Polish author Henryk Sienkiewicz. By looking at the correlations between the relative frequencies of the most frequent words in dialogues, Rybicki was able to discriminate the characters’ idiolect, that is the words and patterns typically used by different characters. Multidimensional scaling plots for translations revealed to be very similar to those of the source texts, though at times the characters’ idiolects “are less different between translations than between either translation and the original” (Rybicki 2006: 102). While not useful towards the identification of traits of individual translator style, such results could be seen as pointing towards “leveling out” (Baker 1993) or “convergence” (Laviosa 1998) as a more general feature of (literary) translation.

Baker (2000) suggests that, while some of the differences between translators can be traced to the influence of source language features or to stylistic choices of the source text author, it may, nevertheless, be possible to identify target language patterns that are less likely to be affected by the source text. One such pattern may be the relative frequency of the optional that in English reporting structures, which can be seen as an indication of a smaller or larger degree of syntactic explicitation by different translators. In her corpus, she found a rather different distribution of zero/that realizations in the works of the two translators, apparently independent of the source languages. The same pattern is investigated by Saldanha (2008, 2011a, 2011b) who, however, in order to control the source language variable, compares works translated by two different translators (Margaret Jull Costa and Peter Bush) from the same source languages (Spanish and Portuguese), and combines Baker’s methodology with parallel corpus analysis. She finds that Jull Costa tends to use the optional that after reporting verbs with a frequency similar to that found in the reference corpus of translations (the TEC), while Bush’s usage exceeds the higher average frequency in a comparable corpus of English non translated literature (extracted from the British National Corpus). Saldanha also looks at the use of typographical markers such as italics and quotation marks, to investigate whether consistent patterns of choice differentiate the works by two translators, and if so whether these features can be shown to be due to the translators’ stylistic preferences or whether they are triggered by the source texts. She finds that, once the italics carried over from the source texts are subtracted,[3] one translator (Jull Costa) makes comparatively greater use of italics for emphasis, thus favouring readability in her translations, while the other (Peter Bush) makes a comparatively greater use of italics to highlight source language words. Saldanha also finds that, whereas Jull Costa uses italics mostly for emphasis and often accompanies non-italicized foreign words with glosses, Bush consistently uses italicized foreign words more often than the average in a reference corpus of English translations from Portuguese (extracted from the COMPARA parallel corpus, Frankenberg-Garcia and Santos 2003). Saldanha (2011b) concludes that these patterns of variation may be regarded as individual stylistic traits, which point to different explicitation strategies used by the two translators.

Zanettin (2001) used a parallel English-Italian corpus of six translations by two different translators (Ettore Capriolo and Vincenzo Mantovani) and their source texts by the same author (Salman Rushdie) to analyze the frequencies of Italian translations of expressions of indeterminacy such “a sort of”, “a kind of” and “something of a”, and found that most of the times the English expressions were translated by either one of two Italian phrases, namely "una sorta di" and "una specie di", with seemingly no preference for one or the other. However, and regardless of source text stimulus, it appeared that Capriolo has a strong preference for the phrase “una sorta di”, which he uses in about 70% of all cases, whereas Mantovani has a strong preference for “una specie di”, which he uses in almost 77% of all cases.

As opposed to studies based on parallel corpora of source and target texts, a number of experiments conducted within a target monolingual framework have applied Burrows’s Delta and other quantitative techniques to different analytical tasks. Burrows (2002b) himself tested his Delta methodology on a corpus of literary translations, by comparing a half a million word corpus of English Restoration poetry with the English versions of Juvenal's tenth satire by fifteen different translators, four of which were also authors of the works in the corpus. He found that, while Delta is not very successful when used to attribute the translations to specific authors, it manages to associate a given author/translator to the right translation in three cases out of four. That is, while it was not able to identify Dryden among all others as the author of one of the translations, knowing that Dryden is the author of one of them, it was able, nevertheless, to rightly identify the one most likely to be his. In a study that relies on quite sizable corpora of translations, Rybicki (2012) used Burrows’s Delta to compare between three sets of texts translated into and out of English, Polish and French. The corpus contained literary translations by different translators of multiple source texts by different authors. Cluster analysis by means of tree diagrams showed that works by the same author are grouped together, whereas no such pattern is visible for works by the same translator. Rybicki and Heydel (2013) experimented Burrows’s Delta with a case of collaborative translation involving a Polish translation of Virginia Woolf’s Night and Day. The translation was begun by Anna Kołyszko, who died in 2009 leaving the manuscript unfinished. Magda Heydel, a translator and Woolf scholar, took over editing the finished chapters and translating what was left. Through this method they were able to clearly distinguish which chapters were translated by Kołyszko and which by Heydel, who was consulted to confirm chapter attribution. Conversely, Hung et al. (2010) were able to show, using principal component analysis and other statistical techniques, that some medieval Chinese translations of Indian Buddhist texts traditionally attributed to different translators, were in fact translated by the same translator or group of translators.

Ji and Oakes (2012) used a range of statistical tests (Student t test, chi-squared test, Pearson’s r, etc.) to look at stylistic variation in a corpus containing two late 19th century English translations (by Bowra and Joly) of the Chinese classic novel Hongloumeng by Cao Xuequin. They considered both syntax and lexis, and compared features such as words and sentence length, part-of-speech distribution, words expressing “emotion” and “value”, fixed-phrases and n-grams, without taking into consideration the source text. Their results suggested that “when compared to Bowra’s earlier translation, Joly’s version … has deliberately intended to enhance the idiomaticity of the translated language through an idiosyncratic use of English idiomatic expressions and terms” (Ji and Oakes 2012: 205-206). Finally, Wang and Lee (2012) compared two different Chinese translations (by Xiao and Jin, respectively) of James Joyce’s Ulysses together and with the source text, and then one of the two translations was compared with two subcorpora by the same translator/author, one of translations and the other of original writing. First, they compared the wordlists from the two translations, finding that one translator (Xiao) tends to use colloquial verbs and particles more frequently than the other. Furthermore, they observed that Xiao’s idiosyncratic use of lexicon in translation mirrors the same phenomenon in his original writing.

While much corpus-based translation studies research has used literary narrative texts as primary data, quite often these corpora were not concerned with literary properties, analysis or criticism. Rather, they are used to explore the features of translation in a given language or of translated language as a variety in itself, as a result of the translation process and independently of variables such as source language, date and text type or genre. Thus, for example, while two of the first empirical studies based on corpora of translations (Gellerstam 1986, Laviosa 1997) contained exclusively or almost exclusively literary fiction, the purpose of the investigation was linguistic rather than literary. Gellerstam compared a corpus of translated Swedish fiction with a corpus of fiction originally written in Swedish in order to see whether computer-aided analysis could uncover “translation fingerprints” (Gellerstam 2005: 202). Laviosa (1997, 1998), instead, first used the TEC as a test bed for investigating some hypothesized universals of translation, i.e., supposedly invariant features, which characterize all translated texts, irrespective of source text and language (Baker, 1993). Results from these corpora as often presented as provisional evidence in (dis)favour of the “translation universals” hypothesis, needing to be corroborated by congruent data from larger, more varied, translational corpora in different languages.

3. Features of translator style

According to Hoover (2008: online), when investigating style,

[t]he need for some kind of comparative norm suggests that counting more than one text will often be required and the nature of the research will dictate the appropriate comparison text. In some cases, other texts by the same author will be selected, or contemporary authors, or a natural language corpus. In other cases, genres, periods, or parts of texts may be the appropriate focus. Counting may be limited to the dialogue or narration of a text, to one or more speakers or narrators, or to specific passages.

Studies of translator style have compared different translations of the same source text(s), translations and source texts by the same or different authors, translations by the same authors from the same or from different source languages, translations and non translations by the same writers, translations and reference corpora of non translations in the same language, and source texts and reference corpora in the source languages. In order to compare the work of individual translators with a norm representing a wider range of uses, one or more reference corpora have been used together with the main corpus, this latter containing either different sets of translations or translations and source texts. Corpora used for reference include corpora of non-translated texts in the source language/s, corpora of non-translated texts in the target language/s, and corpora of translated texts in the target language/s.

Huang and Chu (2014) propose a distinction between features that refer to the translator’s “conscious, purposeful and consistent” way of “transferring the ST features to the TT” (Huang and Chu 2014: 136), and features that focus “on the habitual linguistic behaviour of individual translators [and] results from linguistic patterns that are the translator’s subconscious choices.” (ibid.). According to them, these two types of translator style, which they label “S-type” and “T-type”, should be investigated using different methodologies, to be then combined into an integrated model. Thus, first “S-type” style should be detected starting from the comparison of multiple translations of the same source text, and then the consistency of each translator’s style should be verified by using comparable corpora composed of all their translations respectively.

However, not all combinations and comparisons are possible in all cases since, for example, not all texts are translated more than once, not all translators are also authors of original works in the same genre, nor do they all translate from more than one language and, more pragmatically, because not all types of corpus resources may be available for a given piece of research. Furthermore, while each of these types of comparison, usually in some kind of combination (e.g. one source text, multiple translations of it, a reference corpus of translations from the same language), adds a fragment to the general picture, on the one hand increasing the number of authors, languages and source texts makes it more difficult to control target language variables; on the other, increasing the number of translators, translations and non-translated texts in the target language makes it more difficult to control source language variables.

The linguistic patterns investigated rely both on quantitative and qualitative features, and, hence, motivated individual preferences are manifested both as quantitative, largely unconscious linguistic patterns, and as consistent and usually conscious translation choices. Measures of difference, based in some cases on simple word counts and in others on quite complex statistical methods, have taken into consideration the frequency of surface features such as orthographic words (as in STTR and Delta), or features derived from annotation regarding, for instance, parts of speech and words expressing “emotion” and “value” (Ji and Oakes 2012), or typographic features such as italics and quotation marks (Saldanha 2011a, 2011b). The analysis of these features is sometimes totally automated, and sometimes partially automated based on manual classification and selection of relevant instances, for example, by filtering out occurrences of reporting verbs which do not require optional that (Olohan and Baker 2000), or classifying different functions of italics (Saldanha 2001a, 2011b).

Summing up, the results of existing literature suggest that, while the style of a translated text is influenced to a large extent by that of the specific source text, author and language being translated, individual translators, nevertheless, do have their own distinctive style. In fact, some linguistic features have been shown to be independent of source text variables, and may be thus traced to individual preferences. It is possible to attribute “translatorship” of different parts of the same texts to two different translators, or to attribute one out of many translations of the same source text to a given translator, and even to identify patterns of linguistic behaviour which vary consistently across translators, while not being triggered by source text/language features.

While some measures of difference between texts produced by distinct translators seem to be, to a large extent, independent of the translation activity (i.e. they also characterize a translator’s non-translational writing activity), others can be, and have also been, interpreted as indicators of hypothesized universal features of translations. STTR, for instance, has been investigated as an indicator of lexical simplification (Laviosa 1998), identifying the optional reporting that in English as an indicator of explicitation (Olahan and Baker 2000), and the distribution of unusual collocations as an indicator of normalization (Kenny 2000). This can be interpreted as meaning that not all translators behave in the same way, in line with Toury’s view of “translation universals” as probabilistic tendencies rather than deterministic laws (Toury 2004). Thus, within an overall trend towards normalization, some translators may normalize more than others, staying below or above the average values for translated texts as regards, for instance, the relative frequency of unusual collocational patterns, compared to that in non-translated texts. Consistent patterns of choice regarding these features, which may be the conscious or unconscious product of specific translation strategies, can thus be seen as indicators of different individual stylistic traits. Incidentally, the results of two studies which set out to investigate translator style using target monolingual corpora seem to confirm a tendency towards one less studied hypothesized universal feature, that is “levelling out”: Both Hung and Chu (2014) and Rybicki (2006) found that there is less internal variation in a corpus of translations than in that of their source texts, that is, translations seem to be more similar among them than non-translated texts.

4. Conclusions

To conclude, corpus-based studies of literary translation are at the intersection of corpus linguistics and digital scholarship. They should combine quantity with quality, by which I mean not only that they may greatly benefit from annotation-intensive work, including manual annotation and alignment editing, but also that the analysis of corpus data should be complemented by the investigation of the wider cultural and historical context, including the circumstances of production and reception of both source and target texts. That is, while corpora and text analysis software may contribute to uncover textual and linguistic information, findings based on corpora should be assessed and interpreted in the light of extralinguistic knowledge; for as Rommel (2004: 92) so poignantly put it, “[n]o final result, let alone an "interpretation" of a text”- including a translation – “can be obtained by computing power alone; human interpretation is indispensable to arrive at meaningful results”.


Baker, Mona (1993) “Corpus linguistics and translation studies: implications and applications”, in Mona Baker, Gill Francis and Elena Tognini-Bonelli (eds) Text and Technology: In honour of John Sinclair, Amsterdam and Philadelphia: John Benjamins, 233-250.

Baker, Mona (2000) “Towards a Methodology for Investigating the Style of a Literary Translator”, Target, 12(2): 241-266.

Bosseaux, Charlotte (2001) “A Study of the Translator’s Voice and Style in the French Translations of Virginia Woolf’s The Waves”, CTIS Occasional Papers, 1: 55–75.

Bosseaux, Charlotte (2004) “Point of View in Translation: A Corpus-based Study of French Translations of Virginia Woolf’s To the Lighthouse”, Across Languages and Cultures, 5(1): 107–122.

Bosseaux, Charlotte (2007) How Does it Feel? Point of View in Translation: The Case of Virginia Woolf into French, Amsterdam: Rodopi.

Burrows, John F. (2002a) “‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship," Literary and Linguistic Computing, 17(3): 267–287.

Burrows, John F. (2002b) “The Englishing of Juvenal: Computational Stylistics and Translated Texts”, Style, 36(4): 677-699.

Burrows, John F. (2006) “All the Way Through: Testing for Authorship in Different Frequency Strata”, Literary and Linguistic Computing, 22(1): 27–47.

Cohen, Patricia (2010) “Digital Keys for Unlocking the Humanities’ Riches”, The New York Times, November 16, 2010 <> [accessed on 31/02/2015].

Frankenberg-Garcia, Ana and Diana Santos (2003) “Introducing COMPARA: the Portuguese-English Parallel Corpus”, in Federico Zanettin, Silvia Bernardini and Dominic Stewart (eds.) Corpora in Translator Education, Manchester: St Jerome, 71-87.

Gellerstam, Martin (1986) “Translationese in Swedish novels translated from English”, in Lars Wollin and Hand Lindquist (eds) Translation Studies in Scandinavia, Lund: CWK Gleerup, 88-95.

Gellerstam, Martin (2005) “Fingerprints in Translation” in Gunilla Anderman and Margaret Rogers (eds) In and Out of English: For Better, For Worse?, Clavedon: Multilingual Matters, 201-13.

Hewson, Lance (2011) An Approach to Translation Criticism. Emma and Madame Bovary in translation, Amsterdam and Philadelphia: John Benjamins.

Hoover, David L. (2008) “Quantitative Analysis and Literary Studies”, in Susan Schreibman and Ray Siemens (eds) A Companion to Digital Literary Studies, Oxford: Blackwell. <> [accessed on 31/02/2015].

Huang, Labo and Chiyu Chu (2014) “Translator’s style or translational style? A corpus-based study of style in translated Chinese novels”, Asia Pacific Translation and Intercultural Studies, 1(2): 122–141.

Hung, Jen-Jou, Marcus Bingenheimer and Simon Wiles (2010) “Quantitative evidence for a hypothesis regarding the attribution of early Buddhist translations”, Literary and Linguistic Computing, 25(1): 119-134.

Ji, Meng (2012) “Hypothesis testing in corpus-based literary translation studies”, in Michael Oakes and Meng Ji (eds) Quantitative Methods in Corpus-Based Translation Studies, Amsterdam and Philadelphia: John Benjamins, 53-72.

Ji, Meng and Michael P. Oakes (2012) “A Corpus study of early English translations of Cao Xueqin’s Hongloumeng”, in Michael Oakes and Meng Ji (eds) Quantitative Methods in Corpus-Based Translation Studies, Amsterdam and Philadelphia: John Benjamins, 175-208.

Johansson, Stig (2003) "Reflections on Corpora and their Uses in Cross-linguistic Research", in Federico Zanettin, Silvia Bernardini and Dominic Stewart (eds) Corpora in Translator Education, Manchester: St. Jerome, 135-48.

Kenny, Dorothy (2000) Lexis and Creativity in Translation: A Corpus-based Study. Manchester: St. Jerome Publishing.

Laviosa, Sara (1997) “How Comparable Can ‘Comparable Corpora’ Be?”, Target, 9(2): 289-319.

Laviosa, Sara (1998) “Core patterns of lexical use in a comparable Corpus of English narrative prose”, Meta, 43(4): 557-570.

Liu, Zequan, Liu Chaopeng, and Zhu Hong (2011) “Hong Lou Meng si ge yingyiben yizhe fengge chutan. [An exploration of the translators’ styles of four English versions of Hong Lou Meng.]”, Chinese Translators Journal, 32 (1): 60–64.

Maczewski, Jan-Mirko (1996) “Virginia Woolf's The waves in French and German waters: a computer assisted study in literary translation”, Literary and Linguistic Computing, 11(4): 175-186.

Mahlberg, Michaela (2007) “Corpora and translation studies: textual functions of lexis in Bleak House and in a translation of the novel into German”, in Vittoria Intonti, Graziella Todisco and Maristella Gatto (eds) La Traduzione. Lo Stato dell'Arte. Translation. The State of the Art, Ravenna: Longo, 115-135.

Mikhailov, Mikhail (2002) “Two Approaches to Automated Text Aligning of Parallel Texts in Fiction”, Across Languages and Cultures, 2(1): 87-96.

Munday, Jeremy (1998) “A Computer-assisted Approach to the Analysis of Translation Shifts”, Meta, 43(4): 542-556.

Olohan, Maeve and Mona Baker (2000) “Reporting that in Translated English: Evidence for Subconscious Processes of Explicitation?”, Across Languages and Cultures, 1(2): 141-158.

Øverås, Linn (1998) “In Search of the Third Code: An Investigation of Norms in Literary Translation”, Meta,43(4), 571-88. <>.

Paloposki, Outi (2012) “Translation Criticism”, in Yves Gambier and Luc van Doorslaer (eds) Handbook of Translation Studies, Vol. 3, 184-190.

Price, Kenneth M. (2008) “Electronic Scholarly Editions”, in Susan Schreibman and Ray Siemens (eds) A Companion to Digital Literary Studies, Oxford: Blackwell. <> [accessed on 31/02/2015].

Rommel, Thomas (2004) “Literary Studies”, in Susan Schreibman, Ray Siemens and John Unsworth (eds) A Companion to Digital Humanities, Oxford: Blackwell Publishing, 88-96.

Rybicki, Jan (2006) “Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz's Trilogy and its Two English Translations”, Literary and Linguistic Computing, 21(1): 91-103.

Rybicki, Jan (2012) “The great mystery of the (almost) invisible translator. Stylometry in translation”, in Michael Oakes and Meng Ji (eds) Quantitative Methods in Corpus-Based Translation Studies, Amsterdam and Philadelphia: John Benjamins, 231-248.

Rybicki, Jan and Magda Heydel (2013) “The stylistics and stylometry of collaborative translation: Woolf’s Night and Day in Polish”, Literary and Linguistic Computing, 28(4): 708-717.

Saldanha, Gabriela (2008) "Explicitation Revisited: Bringing the Reader into the Picture", in Juliane House (ed) Beyond Intervention: Universals in Translation Processes, trans-kom 1(1): 20-35 <> [accessed on 31/02/2015].

Saldanha, Gabriela (2011a) ‘Style of Translation: The Use of Source Language Words in Translations by Margaret Jull Costa and Peter Bush’ in Alet Kruger, Kim Wallmach and Jeremy Munday (eds) Corpus Based Translation Studies: Research and Applications, London: Continuum, 237‐258.

Saldanha, Gabriela (2011b) “Translator style: Methodological considerations”, The Translator, 17(1): 25–50.

Toury, Gideon (2004) “Probabilistic explanations in translation studies: Welcome as they are, would they qualify as universals?”, in Anna Mauranen and Pekka Kujamäki (eds) Translation Universals. Do they exist?, Amsterdam and Philadelphia: John Benjamins, 15-32.

van Dalen-Oskam, Karina (2013) “Names in novels: An experiment in computational stylistics”, Literary and Linguistic Computing, 28(2): 359-370.

Wang, Qing and Li Defeng (2012) “Looking for translator's fingerprints: a corpus-based study on Chinese translations of Ulysses”, Literary and Linguistic Computing, 27(2): 81-93.

Winters, Marion (2004) “German Translations of F. Scott Fitzgerald’s The Beautiful and Damned: A Corpus-based Study of Modal Particles as Features of Translators’ Style”, in Ian Kemble (ed) Using Corpora and Databases in Translation, Portsmouth: University of Portsmouth, 71–89.

Winters, Marion (2007) “F. Scott Fitzgerald’s Die Schönen und Verdammten: A Corpus-based Study of Speech-act Report Verbs as a Feature of translator’s Style”, Meta, 52 (3): 412–425.

Winters, Marion (2009) “Modal Particles Explained: How Modal Particles Creep into Translations and Reveal Translators’ Styles”, Target, 21 (1): 74–97.

Zanettin, Federico (2000) “Parallel corpora in translation studies. Issues in corpus design and analysis”, in Maeve Olohan (ed) Intercultural Fault Lines, Manchester: St Jerome, 105–118.

Zanettin, Federico (2001) IperGrimus. inTRAlinea Monographs, <> [accessed on 31/02/2015].

Zanettin, Federico (2012) Translation-Driven Corpora. Corpus Resources in Descriptive and Applied Translation Studies, London and New York: Routledge.

Zubillaga, Zuriñe, Naroa Uribarri and Ibon Sanz (2015) “Building a trilingual parallel corpus to analyse literary translations from German into Basque”, in Claudio Fantinuoli and Federico Zanettin (eds) New Directions in Corpus-based Translation Studies, Berlin: Language Science Press. 


[1] “Burrows begins by recording the frequencies of the most frequent words of a primary set of texts by the possible authors and calculating the mean frequency and standard deviation for each word in this set of texts. He then uses z-scores to compare the difference between the mean and each of the primary authors with the difference between the mean and the questioned text for each of the words. He completes the calculation by averaging the absolute values of the z-scores of all the words to produce Delta, a measure of the difference between the test text and each primary-set author. The primary set author with the smallest Delta is suggested as the author of the test text.”(Hoover 2008: online).

[2] The results cumulated those from their study, Baker’s (2000) study, and those from yet another study, by Liu et al. (2011), who compared two different English translations of the Chinese classic novel Hongloumeng by Cao Xuequin.

[3] Saldanha (2011: 35) explains that instances of italics “were classified according to function (italics used for emphasis, italics highlighting foreign words, titles of books, words mentioned rather than used, etc.) and [according] to whether they were carried across from the source to the target text or otherwise omitted or added in the target text”.

About the author(s)

Federico Zanettin (PhD in Translation Science, University of Bologna) is Associate Professor in English Language and Translation at the University of Perugia.

Email: [please login or register to view author's email address]

©inTRAlinea & Federico Zanettin (2017).
"Issues in Computer-Assisted Literary Translation Studies"
inTRAlinea Special Issue: Corpora and Literary Translation
Edited by: Titika Dimitroulia and Dionysis Goutsos
This article can be freely reproduced under Creative Commons License.
Stable URL:

Go to top of page