Linguistic Fingerprints and Literary Fraud

Keywords

authorship, lexical statistics, literature, La Vie devant soi, Style, attribution d'auteur, statistique lexicale, littérature, Émile Ajar, Roman Gary, La Vie devant soi

How to Cite

Tirvengadum, V. (1996). Linguistic Fingerprints and Literary Fraud. Digital Studies/le Champ Numérique, (6). DOI: http://doi.org/10.16995/dscn.187

Download

Download HTML

1221

Views

340

Downloads

1. Introduction

In the mid-nineteen-seventies, France witnessed one of the most elaborate hoaxes ever to be played on the literary scene when Romain Gary, a well known French author, published a few novels under the name of Émile Ajar. The main reason for this subterfuge was that Gary, already a well-known figure in France (a war hero, a "chevalier de la légion d'honneur", and a recipient of the Prix Goncourt -- France's highest literary award), wanted to escape from the context in which critics and readers alike had pegged him, and have his novels judged on their own merits and not on his established reputation.

His first Ajar novel, Gros-Câlin, immediately attracted the attention of critics and readers alike and became a best-seller, while new novels published under the Gary name were not as successful. When a few astute critics noticed similarities between Gary and Ajar, Gary vehemently denied having any connection with Ajar. He then persuaded his nephew, Paul Pavlowich, to impersonate Ajar. To quash any further rumour that he was Ajar, Gary even accused Ajar of plagiarising him.

In 1975, with increasing paranoia, Gary wrote a second Ajar novel, La Vie devant soi, which became an immediate success; very soon afterwards, Gary/Ajar was awarded the Prix Goncourt. He thus became the first author (and presumably the last one) to receive this award twice -- something that is strictly forbidden by the Goncourt academy.

It was after Romain Gary's suicide in 1980 that two books -- L'Homme que l'on croyait, published in 1981 by Paul Pavlowich, and the posthumous confession by Gary, Vie et mort d'Émile Ajar, also published in 1981 -- enabled readers to demystify the double disguise: Ajar was the pseudonym of Gary and not of Paul Pavlowich. Critics, then, began to notice similarities between the Gary and the Ajar novels in terms of ideas, characters, images, recurring motifs and phrasings. But so far, no one has undertaken a comparative analysis of the Gary and Ajar literary style using statistical methods.

2. Hypothesis

Nearly all experts of literary style, from Buffon to Roland Barthes, postulate that style is dictated by the subconscious and forms the "genetic" fingerprint of a writer's work. This implies, in the first place, that it is impossible to disguise one's style and, in the second place, that works written under a pseudonym should contain the genetic fingerprint of the writer. If Buffon's assertion that "le style c'est l'homme même" (style makes the writer) is true, the Émile Ajar corpus should prove to be statistically similar to the Romain Gary corpus.

The work of Romain Gary provides thus an excellent example for the study of authorship attribution -- which is, itself, the analysis of stylistic traits of an author as an index of authenticity. In this paper we will deal mainly with vocabulary distribution as an element of style. Firstly, we will look at high frequency words and, secondly, we will look at synonyms as style discriminants. These two methodologies are based on the well-known works by John F. Burrows Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method (Burrows 1987) and a study of style of The Federalist Papers undertaken by Mosteller and Wallace (Mosteller and Wallace 1964). Burrows' assertion rests on the premise that the essential element of an author's style is not confined in the rare lexical words likely to evoke love, hate or war, but in the forty or fifty unambiguous and most common word types in the entire corpus. In their analysis of The Federalist Papers, Mosteller and Wallace focus their research mainly on the use of synonyms such as while and whilst, on and upon, as style discriminants to make their conclusions.

3. Methodology

In order to test if the Gary corpus is statistically similar to the Ajar corpus, I scanned four Gary novels, two of which were published by Gallimard under the Gary name: Au-delà de cette limite votre ticket n'est plus valable (Gary 1975a) and Clair de femme (Gary 1977), and two of which were published by Le Mercure de France under the Ajar name: Gros-Câlin (Gary 1974) and La Vie devant soi (Gary 1975b). As Romain Gary's literary career spanned nearly thirty years, these four books, all written within a four year period, were chosen to avoid problems associated with change in style over time.

After the scanning process, alphabetical concordances and word counts (using the Oxford Condordance Program [OCP 1989] and WordCruncher [WordCruncher 1988]) were established. From these, another program sorted out the words in descending order, yielding a list of highest frequency words in the Ajar and Gary novels. As well, because testing Ajar against Gary would not in itself have been conclusive, other twentieth-century French novels were included in the tests: Camus' L'Étranger (Camus 1942), Gide's L'Immoraliste (Gide 1902) and La Porte étroite (Gide 1904), and Mauriac's Le Noeud de vipères (Mauriac 1932); for these novels, I consulted a series of texts held in the ARTFL database (ARTFL 1997).

While it is true that the time-span for these eight novels ranges from 1902 to 1975, it is safe to assume that French language, syntax, grammar and usage will not have changed drastically (if at all) within that time frame. Moreover, these additional four novels were chosen because they are of similar lengths to the Gary and Ajar books, they belong to the same genre and, in all of them (as in the Gary and Ajar novels), the narrator is intradiegetic. A list of high frequency words compiled by Gunnell Engwall (Engwall 1974) made up of the most common words in French novels for the period 1962 to 1968 was also included in this study. Most of the work was done on keywords and not lemmas.

The first part of the research focuses on the sixty most frequent words in each novel and the Engwall corpus. Table 1 gives the percentage of these words in each novel and the Engwall corpus.

The top sixty types constitute between 45.95% and 53.66 % of the entire corpus. (For the identity of these words and their occurrences per thousand, see the first nine columns of Table 2.) Except for the novel La Vie devant soi, where de occurs at the relatively low frequency of 23.32 per thousand, de is the most frequent type in all the other novels and the Engwall corpus. The most frequent type in La Vie devant soi is et, which ranks about fourth in the other novels and the Engwall corpus. There are many other similar examples in Table 2. To establish if the difference observed in the frequency at which these words occur is statistically significant, three statistical tests -- Student's t-test, the Pearson correlation Coefficient, and the Chi-squared test -- will be employed.

4. Student's t-test

The t-test (which is a variation of the Standard Normal Curve but is used on small sample sizes) is a method that will allow us to estimate the mean value of a population, when we have no information on the standard deviation of that population (in other words, all French novels having a first person narrator and published in the twentieth Century) to which all the novels in this study belong.

Table 2 shows the mean value for each type in the 8 novels and the Engwall corpus under the symbol mu. (For example the mean value for de is 33.74). The t-test will also help us determine if the values obtained might merely be expected to occur by chance alone. For this test, we must choose an alpha level (or a Confidence Interval), which in this case is the 0.01 (or the 99% Confidence Interval). This means that all values that fall within two boundaries established by the 99% Confidence Interval would do so ninety-nine percent of the time. Any value that falls outside these boundaries would have a 1% likelihood of occurring by chance alone. As we are only dealing with 60 words for each category, this means that very few observations (or only 0.6 observations) should fall outside these boundaries. Table 2 shows these values.

The last two columns in Table 2 show the boundaries within which all values established by the 99% Confidence Interval would fall. All the novels and the Engwall corpus have observations falling outside these boundaries. They are as seen in Table 3.

At the 99% Confidence Interval, most of the novels have between 6.67% and 30% of their observations falling outside the boundaries, whereas La Vie devant soi has more than 50% of the observations falling outside the boundaries. This establishes nothing conclusive, except to show that La Vie devant soi is quite different from the other novels in this study and has to be investigated further. A statistical test that is quite useful in this case is the Pearson Correlation Coefficient.

5. Pearson Correlation

The Pearson Correlation is a precise measure of the way in which two variables correlate. Its value is such as to indicate both the direction (positive or negative) and the strength of the correlation between two variables. The value +1 indicates a perfect positive correlation and the value -1 a perfect negative correlation, whereas a value of 0 indicates no correlation at all. Any value between +1 and 0, and -1 and 0, shows some degree of correlation. A correlation of 0.75340 indicates that approximately 56.76% of the values correlate. (The Correlation Coefficient is the square root of the percentage of variation explained in a linear regression). Table 4 shows correlation among the novels and the Engwall corpus when the 60 most frequent words are compared to one another.

It is not surprising that correlation between all the books is high and significant for, as has already been pointed out by Burrows, the shape of the English language makes it impossible for certain word-types like the (le, la, l' and les in French) and of (de in French) to slip towards the bottom of the list, to be replaced at the top with words like more (plus in French; Burrows 1989: 313). Burrows' observation certainly applies to the French language as well.

However, highest correlation occurs when books written by the same author are compared. For example, L'Immoraliste and La Porte étroite (both written by Gide) correlate the highest with each other, at 0.9563. Clair de femme and Au-delà de cette limite votre ticket n'est plus valable -- both published under the Gary name -- have a high correlation, of 0.9542. High correlation also occurs between these two Gary novels and Gros-Câlin (the first Ajar novel), but not with La Vie devant soi. La Vie devant soi correlates the highest with Gros-Câlin (0.8204), followed closely behind by L'Étranger (with a correlation of 0.8204). La Vie devant soi shows a lower correlation with all the other novels ranging between 0.66 (Engwall) and 0.82 with (Gros-Câlin). Its correlation with the two Gary novels is respectively 0.78 and 0.80 (while correlation between the two Gary novels, as noted above, is at 0.95). This suggests that La Vie devant soi is quite different from the other Gary novels.

When the list of sixty words is condensed to a context-free list (i.e. when all person markers, pronouns, verbs, etc., are removed from the list) the correlation, as seen in Table 5, is obtained.

We observe once more that highest correlation occurs between books written by the same author: L'Immoraliste and La Porte étroite have a high correlation of 0.9682; Clair de femme and Au-delà de cette limite votre ticket n'est plus valable show a correlation of 0.9760. Gros-Câlin and La Vie devant soi, however, show a correlation of 0.8739. In fact, there is a stronger correlation between La Vie devant soi and L'Étranger, at 0.87475, than between La Vie devant soi and the other Gary novels.

The Engwall corpus shows the lowest correlation of all. This is probably due to the fact that this corpus comprises a selection of passages from 25 novels, and is not only confined to intradiegetic novels. But it's interesting to note that it correlates the highest with L'Étranger (0.6626), followed by La Vie devant soi (0.6152). The Pearson correlation tests of relationship between the novels and the Engwall corpus show that La Vie devant soi is different from the Gary novels as well as all the other novels.

The next test that we will use in this paper to determine the degree of similarity between the novels and the Engwall corpus is the chi-squared-test.

6. Chi-squared Test (chi2)

The chi-squared test is a test of probability whose function is to establish whether a discrepancy of a given size is large enough to be dismissed as having occurred by chance alone. The basis lies in testing a "null hypothesis", in which the actual result is compared with the expected result. The hypothesis is upheld when the expectation is satisfied. The formula for this test is as follows: chi2 = (O-E)2/E. The letter O signifies an observed value and the letter E signifies an expected value.

The null hypothesis is this case is that the high frequency words should de similarly distributed among the four Gary novels. One should note that there are a few restrictions governing the use of this test -- firstly, that no expected value must be below 5 and, secondly, that this test cannot be applied to relative frequencies, which constitute a proportion. The test is therefore applied to absolute values only. Table 6 gives chi-squared values for each type in each novel and the Engwall corpus.

To interpret these results, one must note that any chi-squared value that is less than 3.84 is dismissed as being susceptible to chance. Any value that falls between 3.84 and 6.62 is somewhat significant, as it has only a 5% likelihood of having occurred by chance alone. Values that fall between 6.63 and 10.82 are significant, having a 1% likelihood of having occurred by chance alone. Values that fall above 10.83 are very highly significant, having a likelihood of one chance per thousand of occurring by chance alone. (All significant values are shown in bold letters in the table.)

Ticket has 32 significant chi2 values at the 0.01% level, Clair has 30, Câlin 44, Vie 49, L'Étranger 39, L'Immoraliste 35, La Porte étroite 44, Vipères 43, Engwall 48 and La Vie devant soi has 49 significant chi2 values. Total chi-squared values range between 908.82 (Ticket) and 5159.77 (La Vie devant soi). When context-sensitive words are removed from the list, the global chi-squared values are as follows: Ticket 245.54; Clair 355.63; Câlin 666.91; Vie 2359.85; L'Étranger 388.64; Vipères 684.37; L'Immoraliste 525.1; Porte 1025.9; Engwall 798.22. Once more, La Vie devant soi presents the highest chi2 value.

When the Engwall corpus is removed from the group, thus allowing comparisons between the 8 novels alone, the following chi2 values are obtained: Ticket 771.52; Clair 979.73; Câlin 1281.3; Vie 3867.26; L'Étranger 2628.26; Vipères 1698.44; L'Immoraliste 1349.88; Porte 2157.64. Once more, La Vie devant soi presents the highest significant value. When contest-sensitive words are removed from list, the chi2 results are as follows: Clair 267.78; Étranger 327.19; Ticket -327.41; L'Immoraliste 459.99; Câlin 528.49; Vipères 693.84; Porte 809.99; Vie 1675.6. Highest significant value is again indicated by La Vie devant soi.

When we apply the "goodness of fit" test to the data, the theoretical values can now be represented by the two Gary novels Clair and Ticket. By applying the chi-squared formula (O-E)2/E, we get the results shown in Table 7.

In this table, Ex. Val. means the expected value (i.e. the average value for the two Gary novels). Again, the total chi-squared values are all very high. However, Gros-Câlin, the first Ajar novel, corresponds the most with the two Gary novels, whereas La Vie devant soi and L'Étranger correspond the least with the two Gary novels. In fact, there is more similarity between the two Gary novels and L'Immoraliste, La Porte Étroite, Le Noeud de vipères and the Engwall corpus than between the two novels and La Vie devant soi. We also observe similarities between L'Étranger and La Vie devant soi.

When this list is condensed to a context-free list, we get the following chi-squared results: Câlin 414, Vie 1084, L'Étranger 460, Vipères 500, L'Immoraliste 580, Porte 940, Engwall 300. Once more, Gros-Câlin is the most like the two Gary novels, but La Vie devant soi shows less similarity with these two novels.

Most of the chi-squared-tests attempted so far show that La Vie devant soi displays the least similarity with the control group, be it represented by Engwall or the two Gary novels Ticket and Clair de femme.

7. Synonyms

The second part of this study focuses on the use of synonyms in the corpus as a style discriminant. The groups of paired-words chosen for this purpose, as well as their frequency in the corpus, are indicated in Table 8.

Using the chi-squared test on these pairs of synonyms, I shall test the degree to which the occurrence of these paired-words differ from one novel to another. The chi-squared results are shown in Table 9.

Chi-squared values for each individual word are indicated in the table. Those that are significant are shown in bold letters. The last row shows total chi-squared values for each novel and the Engwall corpus, ranging from 52.36 (Clair) to 746.42 (La Vie devant soi), indicating that the difference between the frequency of these synonyms is smaller in Clair de femme and larger in La Vie devant soi.

When a chi-squared test on the four Gary novels (using the contingency table) is done, the following results are obtained. The least significant value is indicated by Gros-Câlin (76.75), followed by Ticket (104.11) and Clair (121.61). The most significant value is indicated La Vie devant soi (255.34). When the three novels -- Clair, Ticket and Câlin -- are compared, the following chi-squared values are obtained: 44.14 (Clair), 33.61 (Ticket) and 31.45 (Câlin). Total chi-squared value for the three novels is 122.2, indicating once more that the paired words do not occur at the same frequency in the three novels.

When the same test is done on Clair, Ticket and La Vie, the following chi-squared values are obtained: 155.9 (Clair), 128.69 (Ticket) and 185.61 (La Vie). The total chi-squared value for all three novels is 470.20, which means that there is a greater difference between the frequency at which these paired-words occur in the novels La Vie and the two Gary novels, taken together, than between Gros-Câlin and the two Gary novels taken together; in turn, this indicates that La Vie devant soi stands apart from the other three Gary novels.

8. Conclusions

The statistical tests done in this paper point to the same conclusion: La Vie devant soi is significantly different from the other Gary novels, as well as the other novels in the test. They also suggest that high frequency words and pairs of synonyms, which are considered by many experts on style to constitute the unconscious elements of an author's style, can indeed be consciously manipulated by the author. The notion that function words (and synonyms) constitute a genetic fingerprint of an author's style is, therefore, disputed by the case of Romain Gary / Émile Ajar.

While Gros-Câlin, the first Ajar novel, closely resembles the two Gary novels, La Vie devant soi is so significantly different from the two Gary novels that it could have been written by another author. It would appear that Gary did not feel the need to change his style drastically in Gros-Câlin, his first Ajar novel, feeling confident that nobody would make the connection between him and Ajar. But, when critics saw similarities between Gros-Câlin and the Gary novels, he became increasingly paranoid and wrote La Vie devant soi, being out to prove that he was not Ajar. In so doing, he consciously or unconsciously changed the genetic fingerprint of the Gary style in that novel.

In this way, the findings in this paper question the notion that function words (and synonyms) constitute a genetic fingerprint of an author's style.

Bibliography

  • ARTFL (1997). American and French Research on the Treasury of the French Language, Chicago: U of Chicago. <URL: http://humanities.uchicago.edu/ARTFL/ARTFL.html>
  • BURROWS, J.F. (1987). Computation into Criticism: A Study of Jane Austen's Novels and an Experiment in Method, Oxford: Clarendon P.
  • BURROWS, J.F. (1989). "'An Ocean Where Each Kind...': Statistical Analysis and Some Major Determinants of Literary Style", Computers and the Humanities, 23: 309-21.
  • CAMUS, Albert (1942). L'Étranger, Paris: Gallimard
  • ENGWALL, Gunnel (1974). Fréquence et distribution du vocabulaire dans un choix de romans français, Stockholm: Skriptor.
  • GARY, Romain (1974). Gros-Câlin, Paris: Mercure de France.
  • GARY, Romain (1975a). Au-delà de cette limite votre ticket n'est plus valable, Paris: Gallimard.
  • GARY, Romain (1975b). La Vie devant soi, Paris: Mercure de France.
  • GARY, Romain (1977). Clair de femme, Paris: Gallimard.
  • GARY, Romain (1980). Vie et mort d'Émile Ajar, Paris: Gallimard.
  • GIDE, André (1902). L'Immoraliste, Paris: Mercure de France.
  • GIDE, André (1904). La Porte étroite, Paris: Mercure de France.
  • MAURIAC, François (1932). Le Noeud de vipères, Paris: Mercure de France.
  • MOSTELLER, F. & D.L. WALLACE (1964). Inference and Disputed Authorship: The Federalist, Reading, MA: Addison-Wesley.
  • OCP (1989). Oxford Condordance Program, Oxford Electronic Publishing, Oxford UP [See HOCKEY, Susan (1989). Micro-OCP User Manual, Oxford: Oxford UP].
  • PAVLOWICH, P. (1981). L'Homme que l'on croyait, Paris: Fayard.
  • WordCruncher (1988). WordCruncher, Orem, UT: Electronic Text Corporation.

Share

Authors

Vina Tirvengadum (University of Manitoba)

Download

Issue

Dates

Licence

Creative Commons Attribution 4.0

Identifiers

File Checksums (MD5)

  • HTML: 92219fc786b16c9d5a0170ec6d3b136d