Literature scholars wanting online corpora or databases have a very rich range of choices. It is necessary at the outset to distinguish types of corpora. There is the sample corpus, such as the Brown Corpus (Kučera and Francis, 1967), or the LOB Corpus (Johansson, 1978), maintained by ICAME in Bergen (Hofland and Johansson, 1991), or Engwall's (1984) Vocabulaire du roman français. These corpora consist of a number of samples of equal size; they are designed to constitute a statistically reliable sample of a particular aspect of a given language. Even when such a corpus corresponds to the interests of a literature scholar, the fact that it is based on samples limits its usefulness. It does provide a statistical baseline for evaluating the frequency of given linguistic phenomena, but it is hard to imagine writing a literary interpretation based on, for example, a 2,000-word sample of a text.
Full-text databases come in several varieties. Where a manageable number of texts has survived into the present time, an exhaustive database encompassing the entire surviving record is possible, as with the Thesaurus Linguae Grecaeurl or the Old English databaseurl. In most cases, however, the potential texts far outnumber the capacity to record, verify and maintain the data. Data in such languages are represented in two types of full-text databases. There is the opportunistic collection: a generous person or group conserves and makes available for further study texts obtained from other sources or texts which have been recorded for a given purpose and are subsequently being made available for other work; the Oxford Text Archiveurl, the Collins Bank of English url, and the B.A.S.I.L.E. corpus in Frenchurl come to mind as examples of this.
Finally there is the more focussed representative full-text database, such as one of the many Chadwyck-Healy English Literature databasesurl or the Trésor de la Langue Françaiseurl. At the same time as they claim to give a reliable picture of the subject matter, they facilitate analysis more congenial to the habits of those of us who study literature. Such corpora provide to researchers the opportunity to study the text in its entirety. Therefore, one can analyze fully a text and its themes without having to assume what is in the rest of the work, as would be the case when working from samples. In addition, one can find word frequencies in each of the included texts, as well as the context of the searched word or phrase. In other words, advanced types of content analysis are facilitated greatly by a corpus such as the Trésor de la Langue Française (TLF). Of course, the validity of the choice of texts is crucial to the value of such representative full-text databases.
Questions can be raised about the choice of texts for inclusion in the Trésor de la Langue Française (TLF) corpus. This corpus was set up almost half a century ago. In November 1957, a conference of experts in the French language, held at the Université de Strasbourg, proposed the development of a new historical dictionary of the French language. This dictionary was to be based on a computer-generated database of significant, primarily literary, texts covering the period from 1789 to what was then the present. On approval of the project, research personnel drew up a list of texts on the basis of consecrated histories of literature, including one by the very old-fashioned Gustave Lanson (Imbs, 1971, I, xxiii). This list was revised by a team of professors who both added and deleted titles. After further minor revisions by the project steering committee, the list was approved, and the long process of converting the texts to machine-readable form began. The Préface of the TLF Dictionnaire (Imbs, 1971, I, xlix-xciii) provides a complete list of the titles.
This process reveals the social perspective underlying the creation of the dictionary. Indeed, in his Préface, Professor Imbs makes it clear that it is written for university professors, senior administrators in the private and public sectors, and other highly cultured people, in short what he does not blush to call the “élite” (Imbs, 1971, I, xviii) of French society. It is no surprise then that the author with the highest number of texts in the Trésor database is Paul Claudel: wealthy landowner in the Champagne region, senior career civil servant and ambassador, and proponent of mystical Catholicism. Considerations of inclusiveness, of representativeness, of the type discussed in Scholes (1992) or von Hallberg (1984), do not seem to have concerned the committee which finalized the corpus. One is entitled to wonder to what extent this corpus represents the interests of scholars of French literature a half century later.
It is legitimate to evaluate the extent to which the texts included in the TLF database do represent important trends in French literature. Considerations of practicality have their place in any testing of the representativeness of the TLF corpus, because a method appropriate to the evaluation of this corpus would also be useful in the evaluation of other corpora now available.
Opinions vary concerning what is or is not important in any body of literature, and the claim has been made that consecrated works, the literary canon, are the result of an arbitrary selection process of questionable validity. Thus, any individual opinion of what is important in literature is open to charges that it has been influenced by genre, social class, fad, or personal bias, including gender and age.
A reasonable measure of the importance of a given author or text is the number of publications which scholars of literature choose to produce dealing with one or the other. This information is available online for the period between 1963 and the present. Similar information is not available online for the years during which the TLF corpus was being defined. The Oxford Companion to French Literature (Harvey and Heseltine, 1959), like the TLF database, resulted from collective choices. Work began in 1934 and a first version was ready in manuscript form in 1947; the manuscript was revised thoroughly by professors of French literature teaching at various British universities, and the final work was published in 1959. Thus the Oxford Companion is a reflection of scholarship in the years leading up to the 1960s, much the same as the TLF corpus represented choices informed by the research of a similar period. It is thus possible to see whether the choices embodied in the TLF reflect what scholars of the time judged important, by comparing the choices of authors and of texts in a given genre -- the novel -- to the number of lines dedicated to them found in the Oxford Companion to French Literature.
Similarly, the MLA (Modern Language Association) Bibliography (www.mla.org/publications/bibliography) provides online data showing the number of publications in the modern languages and literatures for the periods 1963-90 and 1991 to the present. A comparison between the number of publications devoted to a novelist found in this bibliography and the number of texts by the same novelist in the TLF will show the extent to which choices made by the TLF group have been confirmed by the interest of later scholars. Given the volume of data involved these questions must be dealt with using statistics.
A subset of the TLF database was chosen for analysis: novels published between 1789 and 1954 (See Table 1: Novel Data in the Trésor de la Langue Française). The name of the novelist (Author) and the number of novel texts included in the database for each writer (Texts) was recorded, along with the publication date of the text included in the database (Pub_Date). When more than one novel by a given author is in the TLF, the second column (Pub_Date) records the date of the earliest one published. When novelists were also known as playwrights or as poets, or for producing works in all three genres, they were not included in the counts of texts, because it was not practical -- and in many cases not even possible -- to sort out what proportion of scholarly interest arose from which genre. These multi-genre authors represent a source of noise in the data. Consequently, authors who would be a source of ambiguity, such as Sartre, Camus, Hugo, and Cocteau were removed from the test data. No account was taken of whether works by a given novelist were included in the section of the TLF database dedicated to non-literary writing. The final set of test data contained a total of one hundred and twenty-four novelists whose first text in the TLF database was published between 1791 and 1954. This corresponds to a total of two hundred and ninety-one texts, ranging in number from twenty-two texts for Balzac to a single text for more than half the authors recorded. Thus, although the mean number of texts was 2.36, the median was 1. In order to facilitate comparative analysis of the various chronological periods represented in the TLF database, the novel data were also divided chronologically into four periods with similar numbers of authors in each. The four periods, with the number of novels included, are as follows: 1789-1859 (32), 1860-1907 (34), 1908-26 (25), and 1927-54 (33).
These numbers were compared to three series of test data. The column OxC in Table 1 records the number of lines devoted to the novelist and to the included novels by that author which are found in the Oxford Companion to French Literature, an encyclopaedia of French literature developed before the elaboration of the TLF database, but whose date of publication is contemporary with the formation of the TLF database. It provides a quantification of the importance of the authors and texts to literature scholars at the time of the inception of the database. The total number of lines devoted to the one hundred and twenty-four TLF pure novelists and their works was 5204. The size of discussions ranges from a maximum of 577 lines for Balzac to zero for approximately thirty mainly 20th-century novelists. The arithmetic mean length of discussions is forty-two lines and the median is fifteen lines.
Published annually since the early years of the 20th century, and compiled by a large team of international scholars, the MLA online Bibliography provides access to academic research in virtually four thousand journals and series. It includes pertinent monographs, festschriften, conference proceedings, learned journal articles, and other formats. Its originally North American orientation has, in the later years of the 20th century, given way to a more cosmopolitan approach. It is now the primary source of bibliographic information relating to studies of literature, language, linguistics, and folklore. The online MLA international bibliography includes coverage from 1963 to the present. This bibliography provides information reflecting current interest in the texts. If one desires more precision, the MLA bibliographic data come divided into the periods 1963 to 1990, and 1991 to the present. The distinction between the two periods in the bibliography was retained in order to preserve a diachronic dimension in the data.
The Columns MLA_1 and MLA_2 in Table 1 record the number of publications mentioning the novelist or work(s) found in the MLA online bibliography. MLA_1 covers the period 1963-1990 and MLA_2, 1991-2000. In each of the two sections of the bibliography, the method used to determine the importance of authors and their works was to count the number of entries relating to them. This was done by searching in the "keyword" and the "title" fields for the author or the text. It was taken for granted that larger numbers of entries indicate more influential authors and works.
In the period 1963-1990 (MLA_1) there were 20,280 publications dealing with the TLF novelists. Marcel Proust was the novelist who attracted the most attention with 2066 publications, and only five novelists included in the TLF database attracted no publications. A typical novelist was the subject of 164.8 publications if the arithmetic mean is taken to be typical; if the median represents typicality, then the number is thirty-six publications. For the period 1991 to the present (MLA_2), Zola with 829 publications had the highest number out of the 9423 publications dealing with the TLF novelists. About twenty novelists of the one hundred and twenty-four sampled had no publications concerning them or their works. The mean number of publications for this period was seventy-six publications and the median was nine publications.
Table 1 shows the first seventeen pure novelists, when presented in alphabetical order. Since the median count for the MLA_1 is four times as large as for MLA_2, it can be seen from Table 1 that Aymé, Balzac, Barbusse, and Bernanos have roughly the same amount of scholarly interest in both periods. Arland, Alain-Fournier, and Hervé Bazin suffer a decline in interest. Paul Adam, Baillon, and Simone de Beauvoir, on the other hand, benefit from a growing interest by scholars of literature. Taking all four columns of data into account for the one hundred and twenty-four novelists manifestly requires a systematic statistical analysis rather than this type of impressionistic commentary.
Figure 1 (Number of Novels in the TLF by Author) summarizes the number of novels included in the TLF database for each author being studied. It confirms the impression already created by Table 1. The number of texts recorded for individual authors falls into a pattern quite familiar to people who work with word frequencies in natural languages. There is a large number of very low frequencies and a very small number of quite high frequencies. In fact, a similar pattern can also be identified in the data from the Oxford Companion to French Literature and from the two MLA online bibliographies. In none of these four sets of data is the arithmetic mean at all close to the median. The data, quite clearly, do not form the familiar bell-shaped curve typical of the Gaussian distribution. The observation that they are not in a Gaussian distribution is confirmed by the fact that the mean and the median are substantially different in the observed sets of data; in a Gaussian distribution, the mean and the median are expected to be identical. Pearson's product-moment correlation analysis, which requires data in a Gaussian distribution, can therefore not legitimately be used on them.
Similarly, these data would produce a very high proportion of predicted values smaller than five in a contingency table for a P2 (Chi-squared) analysis, so this method cannot be employed either. The usual way of handling such a problem when performing a P2 analysis (grouping the data) is not appropriate, since it is the treatment of individual authors which is of interest, and grouping the data would destroy access to this information.
However, Spearman's rank correlation analysis avoids these problems, since it does not require that data conform to the Gaussian distribution, nor does it require predicted frequencies greater than five in a contingency table. Therefore, Spearman's technique has been chosen as the primary analytic technique and applied in pairwise fashion to the data, and to the four chronological subsets of the data. The software (JMP-IN, Sall and Lehman, 1996) also provides the statistics of the probability that the correlations result from chance alone.
At the same time, jackknifed outlier analysis by JMP-IN (Sall and Lehman, 1996) has been used to present visually, on a two-dimensional graph, authors whose distribution varies the most from the trends in the data. The outlier analysis, a multi-variate statistical technique which shows the extent to which a given observation is similar to or different from the other observations, requires some explanation.
When one has a set of variables, each one of which has multiple values attached to it, as is the case with the data being analyzed here (e.g. Texts, OxC, MLA_1, MLA_2), it is possible to determine the mean and the standard deviation of each class of values (e.g. Texts). Plotting the name of the variables along the x axis and the mean on the y axis, one would produce a horizontal straight line; individual values would be points above, at, or below this line depending on their relationship to the mean. It is even possible to shift the points below the line (because they were smaller than the mean) to an equivalent distance above the line by removing their minus sign when there is one; this simplifies the chart since it is the distance from the mean (distance measured in standard deviations) for each value which is of interest.
A similar process could be carried out with the second set of values, to produce a line in three-dimensional space, with the points situated in that space according to the distance from the baseline, which would represent the mean of both the sets of values. Although impossible for most people to visualize, it is mathematically possible for the process to be repeated for as many dimensions as there are sets of values for the variables. Similarly, pairwise correlations among the values of the variables can be expressed as points in n-dimensional space. Once this process is complete, further mathematical transformations can reduce these distances in n-dimensional space to a distance in two-dimensional space. This number is called the Mahalanobis distance. It corresponds to the amount that one variable differs from the mean along all the dimensions defined by the different sets of values associated with it. It has the advantage of being representable in two-dimensions (on a graph), which is easily interpretable. This is because a Mahalanobis distance of zero indicates that the individual variable corresponds exactly to the mean, in all dimensions, whereas the distance increases proportionally to the variation from the overall mean of all the values taken together.
A further refinement is possible in the computation of the mean. One could take all of the values observed for a given class, and determine their mean. Or, one could exclude from the values being included in the computation of the mean for a given variable, the value under consideration and compute its deviation from a mean determined by the other values. The second manner of computing requires a new calculation of the mean for each variable (in the case of the TLF novel data that produces one hundred and twenty-four different calculations of the mean, for each set of values) but has the advantage of preventing extremely extraordinary values from affecting the mean. The latter technique, called a jackknifed approach, has the advantage of making outliers stand out more clearly. It is the one that has been chosen for the outlier analysis of the TLF Novel data.
The data were analysed as a single unit of one hundred and twenty-four variables, and also broken into four chronological sections representing roughly equal-sized sets of data, in terms of the number of authors included. On an experimental basis, an expanded database was also analysed. In order to take into account the possibility of egregious omissions by the TLF Committee in choosing texts, all pure novelists recorded in the Oxford Companion to French Literature as publishing between 1789 and 1960 were added to the dataset. This addition increased the number of authors by fifty-four, without, of course, increasing the number of texts. The number of lines from the Oxford Companion to French Literature went up by 776, and the number of publications found in MLA_1 increased by 510, and for MLA_2 the increase was 307.
Table 2 (Spearman Rank Correlation Coefficients on TLF Novel Data) presents the results of the Spearman's rank correlation analysis of the full set of one hundred and twenty-four variables representing the pure novelists included in the TLF database; it also includes the correlation coefficients and probability estimates for the four sub-sections into which these data were divided. The leftmost set of correlation coefficients and probabilities corresponds to the original data based on the TLF; the rightmost set was produced by the data extended so that authors in the Oxford Companion to French Literature but not in the TLF are also analyzed.
For both sets of complete data the correlation coefficients among the four values recorded for each variable are so great that the possibility of these findings being the result of chance is less than one in ten thousand. In other words, given the number of variables analyzed, the difference between a coefficient of 0.3289 and one of 0.5653 is not significant. All of the sets of variables are so highly correlated that there is no measurable chance of this being a random event.
Looking at the relationship between the correlation coefficients, when the data are broken into chronological periods, does not produce the same certainty. The relation between Texts and OxC improves in one period (1791-1859) with the addition of more data, stays roughly the same for 1860-1907, and becomes less significant in the final two periods (1908-1926, 1927-54). The relationship between Texts on the one hand and MLA_1 or MLA_2 on the other, is different but no less inconsistent; the addition of more data does not systematically affect the results, and seems to be simply a random factor. For this reason, only the results for the original data (in the left-hand columns) will be discussed.
Decreasing the number of variables used to make up a correlation coefficient has the automatic effect of increasing the risk that the coefficients be the result of chance. Even taking that into account, it can be seen that the correlations are all significant at the 0.05 level (although only barely in the case of the variables MLA_1 and Texts for the period 1791-1859). The consistency of the results is much better for the middle years represented in the database (1860-1926), for which one finds all values significant at least at the 0.01 level. The most recent period (1927-54) shows a complete breakdown in reliability, with the correlation between MLA_2 and Texts having equal probability of being significant or the result of chance. This corresponds to the well known difficulty of choosing literarily significant texts before the passage of time has separated significant texts from those which are popular because of a passing fad.
Outlier analysis, which highlights those authors whose distribution patterns are at the greatest variance with the mean, offers an opportunity for better understanding of the data. It will be recalled that the data tend to have a large number of small values. Whether it be one or two texts in the TLF data base, or the number of lines in the Oxford Companion, or number of publications recorded by the MLA bibliographies, only a few novelists are notable for their high values. Thus it is clear that the authors who are statistical outliers should be the important ones.
Figure 2 (Outlier Analysis of TLF Novels, 1791-1954 (n = 124)) shows the outliers when the entire TLF novel database is analysed for the period 1791-1954. The unquestionably important authors -- Balzac, Stendhal, Flaubert, Zola, and Proust -- are joined by less important but still significant authors like Maupassant, Bernanos, and Malraux. Two of the three women authors, Sand and Beauvoir, are notable because they have large numbers of publications recorded in the MLA bibliography, without a large number of texts in the TLF database. With Colette, the numbers are more in harmony and are high across the board. Martin du Gard is the sole example of the opposite phenomenon; his twelve novels in the TLF database correspond with a small and shrinking number of publications recorded in the MLA bibliography.
Sand, a woman author for whom only one novel was included in the TLF database but who had high numbers of lines in the Oxford Companion and a large number of publications recorded in the MLA bibliography is the most notable outlier in Figure 3 (Outlier Analysis of TLF Novels, 1791-1859 (n=32)). Clearly, interest in this author has increased markedly since the 1950s. Three other outliers (Balzac, Stendhal, Flaubert) are greats of the French novel and have higher than usual values across the board. Nodier is notable in that six titles by him are in the TLF novel collection, and the numbers in the other three categories are not commensurate with this importance.
Figure 4 (Outlier Analysis of TLF Novels, 1860-1907 (n=34)) highlights novelists already seen in the outlier analysis of the period 1791-1954: Zola, Maupassant, and Colette, two male and one female novelist. To them are added Gide and France. The former has high values in all four categories, whereas France has declining values with time, a reflection of the fact that this author has gone out of style.
Proust is the most notable outlier in Figure 5 (Outlier Analysis of TLF Novels, 1908-1926 (n=25)), and the most important novelist of this period. To Martin du Gard, an author whose reputation is declining, as already seen in the analysis of the TLF novel dataset for 1789-1954, is now added Alain-Fournier, a second author whose importance is waning. Similarly, Mauriac, a right-wing Catholic and Gaullist author, is added to Bernanos, who has a similar perspective, and was already seen in the analysis of the TLF dataset for the period 1791-1954.
Beauvoir and Malraux are the only authors from the final chronological period (See Figure 6: Outlier Analysis of TLF Novels, 1927-1954 (n=33)) who appeared in the analysis of the entire dataset. Three other authors appearing here -- Saint-Exupéry, Giono, and Céline -- are reasonably important novelists; Simenon, on the other hand, is the French language equivalent of Agatha Christie, and it is a bit surprising to find him placed with the more literary figures just mentioned.
The analysis carried out on the number of novel texts included in the TLF database shows that the texts included tend to be about the same as what might have been included if a different team of scholars of French literature based in British universities -- like the one which finalized the Oxford Companion -- had drawn it up in the late 1950s. If the TLF Committee had bias in its outlook, the bias was no greater than contemporary scholars of French literature in England also suffered from.
Similarly, the works included do correspond -- particularly for the period up to 1908 -- to what scholars of our day find sufficiently interesting to be included in their published studies. There are exceptions to this trend. Women authors, as everyone knows, are now considered more important and more worthy of study than in 1950. By the same token, certain other authors, Anatole France, Roger Martin du Gard, and Alain-Fournier have become of less interest. It is noticeable that the great questioning of the literary canon, and the ringing condemnation of our tendency to study only the output of dead white males in university literature courses has not resulted, or has not yet resulted, in a significant change in whom graduate students and professors find sufficiently literarily interesting to justify a publication.
Casting the net wider, to include fifty more novelists who did not qualify for the TLF database but are included in the Oxford Companion, did not alter these overall results. Clearly the group working out of Nancy had included what was important in French literature, and differences affected only the most minor of literary figures. This suggests that once one has made sure that the greats -- authors of the stature of Balzac, Zola or Proust -- are included in a database one wishes to verify, there is not much point in seeking out further names.
The originally proposed methodology of counting publications mentioning the author or works, using one of the widely available online bibliographies, has been demonstrated as sufficient for validating the choices in a representative full-text database. As more and more such databases become commercially available, the method presented here would seem to have a significance which goes beyond the modern French novel.
 A preliminary version of this paper was read at the 2002 meeting of COCH/COSH held in Toronto, June 2002.
The research reported here is supported by the Social Sciences and Humanities Research Council of Canada (SSHRCC) under grant number 410-98-1348 and 410-03-0500, and the Alzheimer's Society under grant number 04-36.