Strange bedfellows

Shifting paradigms in the corpus-based analyses of literary translations

By Gerard Lynch (Independent Scholar, Ireland)

Abstract & Keywords

Although the application of computational and statistical approaches in the analyses of literary translations has gained considerable traction of late, there is still a good deal of progress to be made on bridging the divide between the cross-disciplinary nature of such analyses and the traditions of translation studies and the computing sciences. This article attempts to document issues and methodological divergences which arise in such work, collaborative or otherwise, drawing upon studies across the digital humanities, stylometry, corpus linguistics and translation studies, and identifying areas in which further progress could be achieved based on a mutual collaborative spirit between disciplines.

Keywords: machine learning, stylometry, corpu-based studies

©inTRAlinea & Gerard Lynch (2017).
"Strange bedfellows Shifting paradigms in the corpus-based analyses of literary translations"
inTRAlinea Special Issue: Corpora and Literary Translation
Edited by: Titika Dimitroulia and Dionysis Goutsos
This article can be freely reproduced under Creative Commons License.
Stable URL:

1. Introduction

The application of computational and/or statistical analyses to the written text is a practice that dates back to as early as the 19th century. In 1887 Mendenhall's pioneering word-length distribution analysis of British authors Dickens and Thackeray was published in Science, followed by an expanded analysis in which he compared Shakespeare with his contemporaries (Mendenhall 1901). This experiment was inspired by earlier personal correspondence from British logician (Augustus) de Morgan, who touched on the idea of using word length measures to determine the author of the Pauline epistles.[1]

From these early steps towards solving a number of age-old ‘who wrote it?’ questions grew a field which now spans disciplinary boundaries. Zipf (1935) defined the eponymous law, one of the earliest forays into corpus linguistics, stating that the frequency of any word in a natural language corpus is inversely proportional to its rank; a phenomenon that had also been noted in relation to population growth in cities (Auerbach 1913).

British statistician George Udny Yule examined sentence length distributions in the works of Coleridge and Bacon (Yule 1944) and carried out an attribution exercise on the works of Thomas à Kempis, spearheading a second wave of authorship attribution exercises using statistical methods. After the Second World War, a number of scholars ushered in a new age of computational language analysis and one should not underestimate the value of Father Roberto Busa’s work. Considered the first digital humanist, his collaboration with Thomas Watson from IBM on a digital concordance of texts by Thomas Aquinas was ground-breaking.

The next milestone in language and computation came through the work of Warren Weaver on a system facilitating automated translation of Russian into English, a Cold War-era US intelligence project that would create the field of machine translation. In its turn, this would generate much of the research methodology and fund the resources, firmly establishing computational linguistics and natural language processing as a discipline in the twentieth century.[2]

Complementing this pioneering work came a study which diverted the attention back to authorship attribution, namely the work of Mosteller and Wallace (1963) on the authorship attribution of the Federalist papers. This was the first such work to use a computer to analyse texts, previous works having relied on statistical methods which were computed by hand.

In today’s era of Google Translate, culturomics and big data,[3] computational analyses of literary translation are still a relatively niche endeavour. Although the application of corpus linguistic methods has become commonplace in translation studies since the 1990’s,[4] there appears to be a smaller body of work concerned with applying state-of-the-art techniques in natural language processing to answering questions about literary translations in particular. This article outlines a number of factors which may be hampering research collaboration in this area and makes a number of suggestions for possible future collaborative studies.

2. Trends in translation stylometry

This section attempts to contrast the methodological treatment of corpus studies in literary stylometry from the perspective of translation studies and corpus linguistics, the digital humanities and literary stylometry, and the new emergent paradigm which employs text classification and machine learning approaches for research purposes.

2.1 First phase: Of translation universals and early corpus studies

Since Gellerstam (1986) coined the term translationese as a descriptor for what Frawley (1984) referred to as the third code,[5] there has been interest in the application of corpus linguistic methods to questions of translation stylometry, in particular quantifying distinctions between translationese and non-translated language. Baker (1993, 2000) defined a framework for investigating the stylistic properties of parallel translations, giving examples from the literary domain. While not excessively sophisticated, the computational methodology used, based on the analysis of word frequencies, was not common at the time in the translation studies community. Baker posed three main questions when investigating stylistic variation in literary translation: 1) whether a translator’s preference for particular linguistic constructions is independent of the style of the original author, 2) whether the stylistic choices are independent of the general preferences of the target language, and if so, 3) whether they could be explained with reference to the social, cultural or ideological position of the translator.

Adopting a similar methodology, Mikhailov and Villikka (2001) focused on Finnish translations from Russian using statistical measures from corpus linguistics, reporting consistencies in translator’s style across separate translations. Their approach results in the conclusion that modals and sentence length define translator’s style, and they do not consider the translator’s social background or other reasons for the stylistic variation. Olohan (2001) focuses on the explicitation translation universal in her work which focused on optional items in translated English, comparing a corpus of translations with the British National Corpus using statistical tests. Other substantial corpus-based studies include Kenny (2001) on lexical choice in translation, Laviosa-Braithwaite (1997) on simplification (see Zanettin (2013) for an excellent overview).

More recently, Saldanha (2011) expands upon Baker’s blueprint for corpus linguistic research in translation studies, investigating a corpus of translations from Portuguese and Spanish into English, with a focus on stylistic features such as italicized text, preservation of cultural expressions from the source language and use of the that connective. She also compares the individual frequencies of each feature per translator with a bilingual reference corpus in order to situate the feature variation in a larger linguistic context. Winters (2007) work on German translations of F. Scott Fitzgerald’s The Beautiful and the Damned identified speech-act reporting verbs as markers of translator’s style. Li, Zhang, and Liu (2011) investigated translator style from a more sociological perspective, contrasting the translation of the Chinese epic Hongloumeng by a British sinologist with that of an official Chinese government-sanctioned translator, focusing on type-token ratio and related corpus measures to profile stylistic divergence and thereby translation ideology.

As evidenced by the literature, these studies tend to be fine-grained accounts of the translation of one particular work or of a pair of translators, paying close attention to any cultural or ideological bias or effects in the translation which can be attributed to the translator. This can contrast somewhat with the studies in literary stylometry in the next section, which can be indeed fine-grained but tend not to focus on cultural and ideological markers in translation, but rather on issues of stylistic development over time, influence and identifying divergent patterns in joint translations.

2.2 Second Phase: Burrowing into Translation and Corpora++

Alongside the research in the translation studies literature, scholars in the areas of digital humanities and literary stylometry have been applying corpus linguistic methods to questions of stylometry for quite some time. Although translations were not the main focus of the discipline, there are a number of scholars who focused on tasks in this space. Commonalities in these studies include the usage of non-standard statistical metrics such as John Burrow’s Delta method[6] and a focus on the stylometric distinction of character idiolects in an author’s canon, which is often extended to translation. Another aspect of studies in this field which refer to translation is the focus on pre-20th century texts, presumably for reasons of data accessibility. Following on from his pioneering study on the analysis of character idiolects in the works of Jane Austen (Burrows 1987), in later work Burrows (2002b) inadvertently established a breakaway subfield of literary translation stylometry by applying his Delta metric to twenty translations from Latin to English of Roman poet Juvenal, identifying a seventeenth century translation by Thomas D’Urfey as the most similar to all of the other translations. Burrows concluded that this translation may have been used as a reference translation for later works. The application of stylometry to non-prose text is generally less common, although Pantopoulos (2009) investigated the stylistic characteristics of four English-language translators of modern Greek poetry using corpus linguistic methods, and recent work by Herbelot (2014) investigated a distributional semantics approach to modelling poetry.

Rybicki (2006) took up the mantle of literary translation stylometry and applied Burrow’s Delta metric to character idiolects in translations of the Polish epics of Sienkiewikz. He found similarly distinct clustering of character idiolects[7] in translations and in original texts. Lynch and Vogel (2009) applied Rybicki’s methodology coupled with the χmetric to translations of Henrik Ibsen in English and German and their source texts, and observed similar patterns of distinctive characterization in source texts and translations alike. Rybicki (2012) continued his work on translation stylometry with Burrow’s Delta, finding that texts tended to cluster by author rather than translator in a corpus of English to Polish literary translations using frequent words as features, a result which he claims verifies Venuti (1995) and his theory of a translator’s invisibility. Later work by Rybicki and Heydel (2013) investigated a collaborative translation between two translators working on the same translation of Woolf’s Night and Day into Polish, where one translator completed the work of the other after her death, successfully identifying the point where the new translator took up the task. Related work on Slavic languages was carried out by Grabowski (2011, 2013) who investigated translation patterning in English, Russian and Polish versions of Nabokov’s Lolita, focusing on sentence length and word type/token distributions in source and target texts.

The majority of stylometric studies on literary translation use relatively shallow textual features. However, a handful of studies go beyond word n-grams and Burrow’s Delta in their scope. Lucic and Blake (2011) use parses from the Stanford Lexical Parser (Chen and Manning 2014) of two English translations of Rainer Maria Rilke and observe differing patterns of negation and adverbial modifiers between the two translations. A novel approach is employed by El-Fiqi, Petraki, and Abbass (2011) who use a network-theoretic linked representation of a text to identify patterns in two translations of the Holy Q’Uran into English, Hung et al. (2010) who use variable-length grams to attribute Chinese translations of 4th century Buddhist texts written in Sanskrit, and Popescu (2011) who investigates translations in a corpus of literary text from Project Gutenberg using string kernels (i.e. sequences of characters) as distinguishing features, but advocates caution in this approach from an interpretability point of view.

The literary stylometric studies which focus on translation have some elements in common with the studies from the translation studies literature, such as the focus on individual translators and authors, although the focal points of the research projects can often differ, with more of an emphasis on attribution, stylistic profiling and language evolution. The next phase of studies share properties including macro-level analysis and discovery of over-arching trends and universals when applied to translated text.

2.3 Third Phase : The Machine (Learning) Age

The adoption of machine-learning approaches to all possible hypotheses in the current era of big data does not exclude questions of corpora and translation and this phase follows the pioneering work of Baroni and Bernardini (2006) on a comparable corpus of Italian journalistic text. These methods can identify stylistic patterns in textual corpora with considerable ease and may be useful to identify hitherto unconsidered textual variation. These studies often employ a wide range of textual feature representations, including part-of-speech tokens, corpus statistics such as type-token ratio and lexical richness measures and word n-grams. Commonalities in studies employing such methodology include investigation of the stylistic patterns of individual translators and macro-analyses, often eschewing literary genres for journalistic text and focusing on translations as target language corpora only.

Questions regarding gender effects in translation stylometry have been raised in Shlesinger et al. (2010), following on earlier work by Koppel et al. (2002) which investigated gender differences in literary language using machine learning methodology, and Leonardi (2007), who investigated differences in translator ideology and language by gender. In the experiments on translated text, their classifiers failed to distinguish between male and female translators using frequent word features, with classification results barely above the chance baseline of 50%, although earlier corpus statistics had identified statistically significant features in the overall subcorpora. The corpus used in these experiments consisted of 213 literary extracts translated from more than 30 source languages.[8] From this study, the authors surmise that the application of supervised learning approaches to questions of textual stylometry provide a more rigorous methodology for establishing the efficacy of stylistic features. Conventional statistics identified a set of features which occurred statistically more often in translations from translators of each gender. However, these features ultimately did not prove effective at discriminating the gender of the translator of a text using machine learning methods. The subject of language, gender and translation is one which deserves further investigation, perhaps using a source-language restricted corpus to restrict variability and improve the efficacy of any machine learning experiments by restricting the number of possible factors of influence on textual style.

Lynch and Vogel (2012) applied support vector machine classifiers to a corpus of literary translations in order to detect the source language of a textual segment, an approach which had been considered previously for non-literary text such as Europarl and journalistic articles from The New York Times and Haaretz, among others (van Halteren 2008, Koppel and Ordan 2011). This approach succeeded in distinguishing the source language of a text and identified a number of corpus features and word and POS n-grams[9] which were distinctive between translations from different source languages. However, the study did not rigorously examine whether features were truly source language markers or somehow artefacts of the texts chosen. Recently, this work was replicated (Klaussner et al. 2014) using a completely disjoint set of contemporaneous texts. Comparable results were reported, along with additional distinctive features such as trigrams of part-of-speech tags which had not been employed in the previous study, which allays any fears about the original results being dependent on the particular corpus used.

More recent research using advanced machine learning methods have challenged Venuti’s theory of translator’s invisibility. Forsyth and Lam (2013) investigated parallel translations into English of the correspondence between Vincent Van Gogh and his brother Theo and identified separate stylistic traces of original authorship and translator’s style using both machine learning and corpus linguistic approaches, finding that the stylistic divergence between translations (translation distinctiveness) the was less strong than stylistic divergence between originals (authorial discriminability). This correlates with Vajn (2009), who developed a theory of two-dimensional style as applied to parallel English translations of Plato’s Republic, incorporating authorial and translatorial stylistic features into the descriptive process. Bogdanova and Lazaridou (2014) developed a related study investigating the preservation of authorial style across translation from an authorship attribution perspective and attempted to cluster works by their original author both in the original language and translation, one approach being to translate the translations back to the original language using machine translation and applying stylistic clustering techniques to the machine translation and the original text. Lynch (2014) focuses on the translations of Ivan Turgenev, Anton Chekhov and Fyodor Dostoyevsky by the Victorian-era British translator Constance Garnett and uses support vector machine classifiers to guess the original author of a translation segment by Garnett obtaining high classification accuracy between the source authors, using document statistics such as average sentence length and lexical richness on the textual segments and even better accuracy employing linguistic features such as reporting verbs and part-of-speech clusters. This work was motivated by an article by Remnick (2005) on the translation of Russian literature in which Vladimir Nabokov and other contemporaries were quoted as mentioning that the translations of Garnett had no distinct voice, her translations of Turgenev being indistinguishable from her renderings of Dostoyevsky.

3. Methodological and motivational considerations

3.1 Supervised learning and corpus tools

The state-of-the-art in text analytics research and analysis tends towards supervised learning approaches such as support vector machines (Joachims 1998), Naive Bayes classifiers and neural networks. These have an extra degree of complexity than previous statistical metrics employed in corpus linguistics. However, they can be very useful for discovering unknown patterns in data, when compared with traditional statistical measures such as t-tests and chi-square tests which require prior domain knowledge about which features to measure. Although there are both open-source and proprietary software packages available to carry out these analyses such as R (, WEKA (Frank et a. 2005), and RapidMiner (, these tend to come with a learning curve which may be too steep for curious humanists and corpus linguists who have firmly adopted software packages such as the IMS Open Corpus Workbench (Christ 1994) and the WordSmith tools package (Scott 1996) for use in their descriptive analyses.

There is a growing need for more straightforward machine learning platforms which can be used by a greater subsection of the population at large, and Google’s Prediction API ( is a step in this direction, facilitating the application of machine learning for non-experts, although it does not focus specifically on text processing. The emergence of graphical tools which allow for the creation of text-processing pipelines through a drag-and-drop interface is also a development of note in this space which can aid research in advanced textual stylometry in general. Perhaps a hybrid system combining a collocation-viewer with a graphically controlled machine learning module with language analytics functionality via the Stanford NLP Toolkit (Manning et al. 2014) would be the ultimate general-purpose tool of choice for future studies in translation stylometry, although open-source solutions such as AntConc (Anthony et al. 2013) are approaching this level of functionality.

3.2 Corpora and data management

The field of statistical machine translation applies statistical methods to large parallel translation corpora such as Europarl and the Canadian Hansard corpus in order to derive models to translate from one language to another. Machine translation has been a mainstay of natural language processing research since Weaver’s pioneering paper in the 1940’s and attracts a considerable amount of research funding year-on-year. As a result, a good deal of computational stylistic analysis of translations focuses on corpora such as these, with scant regard for literary translations.

Thus, large scale parallel corpora of literary translations are not as prevalent, with the translation studies literature focusing on small-scale analyses of one work or at most, the complete works of a particular author. As a result, studies which focus on a fine-grained analysis of the work of one author using computational methods can often be subject to critique in the computational linguistics literature, as the tendency is to use as large a corpus as possible to obtain statistically significant and generalisable experimental results.

Another aspect hampering fully parallel analyses of literary translations is the availability of parallel reference translations in online sources such as Project Gutenberg (http:// and WikiSource (, which try to improve matters by imposing a wiki-format browsing structure on copyright-expired source texts. Given the effort that can be involved in digitising a print work, it is perhaps unsurprising that, copyright restrictions notwithstanding, it is unusual to find two contemporary translations of the same work on a site such as Project Gutenberg, unless that work is indeed of high literary regard. Due to this accessibility bottleneck, curious computational humanists can quickly become discouraged when searching for corpora for their investigations.

Cheesman et al. (2010) propose the notion of translation arrays, encompassing linked databases of multiple translations of the same source text in various languages, arguing that such data would provide a rich resource for cross-cultural studies in language evolution, in the monolingual case,[10] and general cultural questions also.

Their own related investigations have focused on the visualisation of textual re-use in translations of common texts such as the Bible, the same text examined by Covington et al. (2014) who cluster Biblical translations using dendogram clustering and document-level metrics such as sentence length, average type-token ratio and “idea density” (i.e. the ratio of factual propositions to words). This approach can be extended to any texts for which multiple parallel translations are available.

3.3 Asking questions

Based on the literature surveyed here, the scholar’s own background can tend to influence how they approach a study in translation stylometry, based on the individual norms of their host discipline.

Applying best practices for computational studies can bias the approach of a research project in literary translation stylometry. Such studies tend to adopt a macro-level view into translation stylometry, such as defining characteristics of translationese in a large corpus of reportage (Baroni and Bernardini 2006, Koppel and Ordan 2011), quantifying the existence of translation universals in technical and medical translations (Ilisei and Inkpen 2011) and separating a corpus of technical translations by translation direction ( Kurokawa et al. 2009).

Research stemming from literary studies and translation studies tend to take a more fine-grained approach, usually attempting to ground theories about the cultural background or processes of the translator through an examination of their translation result, often compared with the source, following work by such scholars as Baker and Saldanha. Thus, the application of machine learning and advanced stylometric analysis can often be hampered by the relatively small size of the corpus under examination or the particularly fine-grained nature of the research question.

On the other hand, those in the translation studies field often have first-hand unfettered access to contemporaneous copyrighted translations which are not available to the general public such as the Translational English Corpus (Olohan 2002), which is a considerable advantage in carrying out corpus studies. Thus, cross-discipline collaborations often produce the most interesting experimental design and studies. Examples include projects by computational linguists Marco Baroni and translation studies scholar Silvia Bernardini and collaborations between translator and English professor Jan Rybicki with corpus linguistics scholar Maciej Eder, which can draw on strengths from individual fields and also respect the validity of investigatory norms of the disciplines.

4. Towards a shared future

With this spirit in mind, there are many areas in which both researchers of a quantitative nature and translation studies scholars may collaborate which maximise the skills and talents of each discipline. Within the computational linguistics community, there is a growing interest in the analysis of literary text, including translation, with the establishment of the Workshop on Computational Linguistics for Literature, currently in its third iteration, and issues of collaboration with the humanities have been documented by Hammond et al. (2013). With the relentless drive towards data-driven methods in the digital humanities, it is important not to lose sight of the original tenets of translation studies mainstays, such as Baker, who espouse that corpus-based studies should seek to ground their analyses in the cultural background, bias or preferences of a translator or movement. At the same time, progress in automatic text analytics has resulted in a set of tools and processes which enable detailed, semi-automated linguistic and stylistic analyses of large textual corpora in near real-time.

Neuman et al. (2013) describe computational methods toward the analysis of the quantity of metaphor in text, such approaches may also be applied to literary translations in order to quantify the level of metaphor within these and also apply quantitative approaches in concert with theories of metaphor in translation such as the work by Steiner (2002) Also, any application of Latent Dirichlet Allocation or topic modelling approaches, (Blei et al. 2003)) to translated text may serve to illustrate over-arching trends of interest within parallel translations and may prove a fruitful area of interest for the future.

Other topics which have not received substantial attention to date are questions related to how a translator’s style relates to their own authorial style. There is of course a rather limited pool of translators who are also published authors in their own right. However, Wang and Li (2012) focus on this issue in their study of two Chinese translations of Joyce’s Ulysses. They separate translation effects into those from the source language, such as the post-positioning of adverbial phrases in the target text, from individual lexical choice by an individual translator, manifested in the choice of a dialectically sensitive translation of the verb to know in English and one translator’s systematic overuse of the Chinese verb duo (to stroll or to saunter) in both his translation and his original writing.

Furthermore, the application of syntactic and semantic parsing to literary translations is another area which has received little attention, save previously mentioned work by Lucic and Blake (2011). One possible reason for this gap may be misconceptions about the applicability of software trained on modern textual corpora such as the Wall Street Journal to literary texts,[11] although some basic investigations can confirm resources are available for textual parsing in Latin, (Bamman and Crane 2006), medieval Portuguese, (Rocio et al. 2003) and medieval French (Mazziotta 2010, Stein and Prevost 2013). In a similar vein, van Dalen-Oskam (2012) uses open-source general purpose named-entity recognition tools in her study of the translation of proper names in Dutch novels and their English translations, finding the accuracy of the NER software to be more than sufficient for the needs of the study, although mentions that some customisation of the software would be a welcome addition in order to recognise subtypes of proper names.

Many collaborative endeavours between humanists and computational scientists are commonly built around the idea of creating shared resources as evidenced by the Text Encoding Initiative and related projects (Hockey and Walker 1993). Although there is a rise in the number of data-driven humanities projects and a number of key scholars in the area span both fields quite comfortably, including Jockers (2013) who has investigated large-scale topic models on thematic trends in literature, and the team behind the Google Ngrams project (Michel et al. 2011). A possible study in this space could apply big data analysis methodology to a large corpus of parallel translation with a constant L1 and L2 to investigate trends in stylistic variation that may transcend individual translator’s choices.

On a final note, it may be of interest to investigate the latest machine learning approach du jour, so-called deep learning (Le Cun et al. 2015), which leverages artificial neural networks in multiple layers to learn human-like representations of categories in data. However, these approaches typically expect gargantuan labelled datasets for training and thus may be a little overkill for current purposes.


Altintas, K., Can, F., and Patton, J. (2007). Language change quantification using time-separated parallel transla- tions. Literary and Linguistic Computing, 22(4), 375–393.

Anthony, L., Crosthwaite, P., Kim, T., Marchand, T., Yoon, S., Cho, S.-Y., Oh, E., Ryu, N.-Y., Hong, S.-H., Lee, H.-K., et al. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141–161.

Auerbach, F. (1913). Das gesetz der bevoelkerungskonzentration. Petermanns Geographische Mitteilungen, Vol. LIX 

Baker, M. (1993) Corpus Linguistics and Translation Studies. Implications and Applications, in Baker, M., J. Francis and E. Tognini Bonelli, Text and technology: in honour of John Sinclair, Amsterdam and Philadelphia: John Benjamins, 233-250.

Baker, M. (2000) ‘Towards a methodology for investigating the style of a literary translator’, Target, 12(2), 241-266.

Bamman, D., and Crane, G. (2006). The design and use of a Latin dependency treebank. In Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories (TLT2006), pp. 67–78.

Baroni, M., and Bernardini, S. (2006). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing, 21(3), 259.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Bogdanova, D., and Lazaridou, A. (2014). Cross-Language Authorship Attribution. In Language Resources and Evaluation, Reykavik, Iceland

Burrows, J. (2002a). ’Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267.

Burrows, J. (2002b). The Englishing of Juvenal: computational stylistics and translated texts. Style, 36(4), 677-699.

Burrows, J. F. (1987). Computation into criticism: A study of Jane Austen’s novels and an experiment in method. Clarendon Press Oxford.

Candel-Mora, M. A., and Vargas-Sierra, C. (2013). An analysis of research production in corpus linguistics applied to translation. Procedia-Social and Behavioral Sciences, 95, 317–324.

Cheesman, T., Thiel, S., Flanagan, K., Geng, Z., Ehrmann, A., Laramee, R. S., Hope, J., and Berry, D. M. (2010). Translation Arrays: Exploring Cultural Heritage Texts Across Languages. In Proceedings of Digital Humanities 2012, Kings College London.

Chen, D. and C. D Manning (2014) A Fast and Accurate Dependency Parser using Neural Networks. Proceedings of EMNLP 2014.

Christ, O. (1994). The IMS corpus workbench technical manual. Institut fuer Maschinelle Sprachverarbeitung, Universitaet Stuttgart.

Covington, M. A., Potter, I., and Snodgrass, T. (2014). Stylometric classification of different translations of the same text into the same language. Literary and Linguistic Computing. Advance Access

El-Fiqi, H., Petraki, E., and Abbass, H. (2011). A computational linguistic approach for the identification of translator stylometry using Arabic-English text. In Fuzzy Systems (FUZZ), 2011 IEEE International Conference on, pp. 2039-2045. IEEE.

Forsyth, R. S., and Lam, P. W. Y. (2013). Found in translation: To what extent is authorial discriminability preserved by translators?. Literary and Linguistic Computing.29(2),199-217

Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., and Witten, I. (2005). Weka: A machine learning workbench for data mining. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, 1305–1314.

Frawley, W. (1984). Prolegomenon to a theory of translation. Translation: Literary, linguistic and philosophical perspectives, 159, 175.

Gellerstam, M. (1986). Translationese in Swedish novels translated from English. Translation studies in Scandinavia, 88–95.

Grabowski, Ł. (2011). Crossing the Frontiers of Linguistic Typology: Lexical Differences and Translation Patterns in English and Russian Lolita by Vladimir Nabokov. In New Perspectives in Language, Discourse and Translation Studies, pp. 227–240. Springer.

Grabowski, Ł. (2013). Quantifying English and Polish Lolitas: A Corpus-Driven Stylistic Comparison. In Correspondences and Contrasts in Foreign Language Pedagogy and Translation Studies, pp. 181–195. Springer.

Hammond, A., Brooke, J., and Hirst, G. (2013). A tale of two cultures: Bringing literary analysis and computational linguistics together. In Proceedings of the 2nd Workshop on Computational Literature for Literature (CLFL13), Atlanta.

Herbelot, A. (2014). The semantics of poetry: A distributional reading. Literary and Linguistic Computing. Hockey, S. (2004). The history of humanities computing. A companion to digital humanities, 3–19.

Hockey, S., and Walker, D. (1993). Developing Effective Resources for Research on Texts: Collecting Texts, Tagging Texts, Cataloguing Texts, Using Texts, and Putting Texts in Context. Literary and Linguistic Computing,8(4), 235–242.

Hung, J.-J., Bingenheimer, M., and Wiles, S. (2010). Quantitative evidence for a hypothesis regarding the attribution of early Buddhist translations. Literary and Linguistic Computing, 25(1), 119–134.

Ilisei, I., and Inkpen, D. (2011). Translationese Traits in Romanian Newspapers: A Machine Learning Approach. International Journal of Computational Linguistics and Applications.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 137–142.

Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. University of Illinois Press.

Kenny, D. (2001). Lexis and creativity in translation: a corpus-based study. St Jerome.

Klaussner, C., Lynch, G., and Vogel, C. (2014). Following the trail of source languages in literary translations. In AI-2014: Thirty-fourth SGAI International Conference on Artificial Intelligence, pp. 1–18. Springer.

Koppel, M., and Ordan, N. (2011). Translationese and its dialects. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA.

Koppel, M., Argamon, S., and Shimoni, A. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401–412.

Kurokawa, D., Goutte, C., and Isabelle, P. (2009). Automatic Detection of Translated Text and its Impact on Machine Translation. In Proceedings of the XII MT Summit,Ottawa, Ontario, Canada. AMTA.

Laviosa-Braithwaite, S. (1997). Investigating simplification in an English comparable corpus of newspaper articles. Klaudy and Kohn, 1997, 531–540.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature521(7553), 436-444.

Leonardi, V. (2007). Gender and Ideology in Translation: Do Women and Men Translate Differently?: a Contrastive Analysis from Italian Into English, Vol. 301. Peter Lang.

Li, D., Zhang, C., and Liu, K. (2011). Translation Style and Ideology: a Corpus-assisted Analysis of two English Translations of Hongloumeng. Literary and Linguistic Computing, 26(2), 153.

Lucic, A., and Blake, C. (2011). Comparing the Similarities and Differences between Two Translations. In Digital Humanities 2011, p. 174. ALLC.

Lynch, G. (2014). A Supervised Learning Approach Towards Profiling the Preservation of Authorial Style in Literary Translations. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), Dublin, Ireland. Association for Computational Linguistics.

Lynch, G., and Vogel, C. (2009). Chasing the Ghosts of Ibsen: A Computational Stylistic Analysis Of Drama in Translation. In Digital Humanities 2009: University of Maryland, College Park, MD, USA, p. 192. ALLC/ACH.

Lynch, G., and Vogel, C. (2012). Towards the Automatic Detection of the Source Language of a Literary Translation. In Kay, M., and Boitet, C. (Eds.), Proceedings of the 24th International Conference in Computational Linguistics (Coling), Mumbai, India, pp. 775–784. Indian Institute of Technology Bombay.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 Baltimore, Maryland. Association for Computational Linguistics.

Mazziotta, N. (2010). Building the syntactic reference corpus of medieval French using notabene rdf annotation tool. In Proceedings of the Fourth Linguistic Annotation Workshop, pp. 142–146. Association for Computational Linguistics.

Mendenhall, T. C. (1887). The characteristic curves of composition. Science, pp. 237–246.

Mendenhall, T. C. (1901). A mechanical solution to a literary problem. Popular Science Monthly, 60, 97–105.

Michel, J., Shen, Y., Aiden, A., Veres, A., Gray, M., Pickett, J., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176.

Mikhailov, M., and Villikka, M. (2001). Is there such a thing as a translators style?. In Proceedings of Corpus Linguistics 2001, Lancaster, UK, pp. 378–385.

Mosteller, F., and Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275–309.

Neuman, Y., Assaf, D., Cohen, Y., Last, M., Argamon, S., Howard, N., and Frieder, O. (2013). Metaphor identification in large texts corpora. PloS One, 8(4), e62343.

Olohan, M. (2001). Spelling out the optionals in translation: a corpus study. UCREL technical papers, 13, 423-432.

Olohan, M. (2002). Comparable corpora in translation research: Overview of recent analyses using the translational English corpus. In LREC Language Resources in Translation Work and Research Workshop Proceedings, pp. 5–9.

Pantopoulos, I. (2009). The stylistic identity of the metapoet: a corpus-based comparative analysis using translations of modern Greek poetry. Ph.D. thesis, The University of Edinburgh.

Popescu, M. (2011). Studying Translationese at the Character Level. In Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP’2011). Hissar, Bulgaria.

Remnick, D. (2005). The translation wars. The New Yorker, 7, 98–109.

Rocio, V., Alves, M. A., Lopes, J. G., Xavier, M. F., and Vicente, G. (2003). Automated creation of a Medieval Portuguese partial treebank. In Treebanks, pp. 211–227. Springer.

Rybicki, J. (2006). Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz’s Trilogy and its Two English Translations. Literary and Linguistic Computing, 21(1), 91–103.

Rybicki, J. (2012). The great mystery of the (almost) invisible translator. Quantitative Methods in Corpus-Based Translation Studies: A Practical Guide to Descriptive Translation Research, 231.

Rybicki, J., and Heydel, M. (2013). The stylistics and stylometry of collaborative translation: Woolfs Night and Day in Polish. Literary and Linguistic Computing, 28(4), 708–717.

Saldanha, G. (2011). Translator style: Methodological considerations. The Translator, 17(1), 25–50.

Scott, M. (1996). WordSmith tools.

Selinker, L. (1972). Interlanguage. IRAL-International Review of Applied Linguistics in Language Teaching, 10(1-4), 209–232.

Shlesinger, M., Koppel, M., Ordan, N., and Malkiel, B. (2010). Markers of translator gender: Do they really matter?. Copenhagen studies in language, pp. 183–198.

Stein, A., and Pre´vost, S. (2013). Syntactic annotation of medieval texts. New Methods in Historical Corpora, 3, 275.

Steiner, E. (2002). Grammatical metaphor in translation–some methods for corpus-based investigations1. Language and Computers, 39(1), 213–228.

Vajn, D. (2009). Two-dimensional theory of style in translations: an investigation into the style of literary translations. Ph.D. thesis, University of Birmingham.

van Dalen-Oskam, K. (2012). Names in novels: an experiment in computational stylistics. Literary and Linguistic computing.

van Halteren, H. (2008). Source Language Markers in EUROPARL Translations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 937–944. Association for Computational Linguistics.

Venuti, L. (1995). The translator’s invisibility: A history of translation. Routledge.

Wang, Q., and Li, D. (2012). Looking for translator’s fingerprints: a corpus-based study on Chinese translations of Ulysses. Literary and Linguistic Computing.

Winters, M. (2007). F. Scott Fitzgerald’s Die Schönen und Verdammten: A corpus-based study of speech-act report verbs as a feature of translators’ style. Meta: Journal des traducteurs, 52(3).

Yule, G. U. (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press.

Zanettin, F. (2013). Corpus Methods for Descriptive Translation Studies. Procedia-Social and Behavioral Sciences, 95, 20–32.

Zipf, G. K. (1935). The psycho-biology of language, Houghton Mifflin. Oxford.


[1] Mendenhall mentions De Morgan’s 1872 work A Budget of Paradoxes in the introduction to his 1887 paper, although is himself unsure of the reference location.

[2] A more complete history of computational text analysis in this period is given by Hockey (2004).

[3] Michel et al. (2011) being the canonical work discussing the Google-Ngrams corpus.

[4] See e.g. Candel-Mora and Vargas-Sierra (2013) and Zanettin (2013), who discuss corpus linguistic studies in translation studies literature.

[5] The terms is used to refer to a dialect or variant of a language consisting solely of translations from other languages, and is to some extentrelated to the phenomenon of interlanguage asdefined by Selinker (1972) with reference to thelanguage acquisition context.

[6] See Burrows (2002a) for a detailed explanation of the methodology with examples.

[7] For instance villains vs. heroes, male vs. female.

[8] The data were obtained from the website

[9] Examples included readability scores and contractions such as that’s and it’s.

[10] See Altintas et al. (2007) for an example of work on temporally separated Turkish translations.

[11] Google NGrams FAQ estimates POS-tagger accuracy of 90% for 19th century text in their corpus, see http://books.

About the author(s)

Dr Gerard Lynch is the Lead Data Scientist at Popertee Ltd, which specialises in location intelligence and campaign management for the new era of engagement. Prior to joining Popertee, he held data science and natural language processing focused roles in the healthtech and media sectors. He received a Phd in Computational Linguistics from Trinity College Dublin in 2013.

Email: [please login or register to view author's email address]

©inTRAlinea & Gerard Lynch (2017).
"Strange bedfellows Shifting paradigms in the corpus-based analyses of literary translations"
inTRAlinea Special Issue: Corpora and Literary Translation
Edited by: Titika Dimitroulia and Dionysis Goutsos
This article can be freely reproduced under Creative Commons License.
Stable URL:

Go to top of page