In designing an epistolary corpus based on material in a letter collection, certain challenges are presented by the text-type -- personal letters -- and by users of machine-readable corpora -- linguists and literary scholars. Stated briefly, this project presents the opportunity to design a corpus suitable for both literary and linguistic research, an attribute not applicable to many of the major corpora now in use, with the exception of Old English and dictionary corpora such as the Early Modern Dictionaries Corpus. Linguists and literary scholars, generally speaking, come at language from opposite directions: linguists from the word and phrase level up to text or discourse level, literary scholars from the text level down to the word and phrase level. They meet each other somewhere in the fluid regions of discourse analysis, literary stylistics, and genre typology and taxonomy. Rarely do they swim in the same textual waters. Linguists work with language samples; literary scholars work with authored texts. If a corpus is by definition a collection of texts/samples designed for a particular "representative" function (Oostdijk 1991: 19), then it is not surprising that corpora infrequently meet the requirements of researchers in both disciplines. Yet a specialized corpus designed for a variety of uses is congruent with the research aims of some scholars in both disciplines. First I shall describe my letter collection briefly, and then I shall look at the proposed corpus from a linguistic perspective and finally from a literary point of view.
The Letter Collection
The collection at present consists of approximately 2,000 letters written by or to nineteenth-century women writers during a period of roughly fifty years, 1820-1870. Almost all the letters are unpublished and have been transcribed from manuscripts in archives across Great Britain, the United States and Canada. Around twenty-five women writers are represented, mainly British, although a few are American; "writers" are defined as published authors of more than one work. Well-known figures such as Harriet Martineau, Anna Jameson, Mary Russell Mitford, Fanny Kemble, Florence Nightingale, and Harriet Beecher Stowe are represented, as are lesser-known women such as Anna Maria Hall, Sarah Austin, Mary Howitt, Catharine Sedgwick, and Harriet Grote. The collection was amassed for my doctoral dissertation, a taxonomic study that defines and describes the epistolary form as it is found in nineteenth-century women's letters. My interest in these letters arose from a previous study of the letters of Elizabeth Barrett Browning (EBB) where I discovered that her female correspondents, apart from relatives and friends of both Brownings, were virtually all writers, and that these women seemed to correspond amongst themselves. Questions such as "Was there a network of nineteenth-century women writers?" and, if so, "What kind of relationships did they have?" prompted my search for the letters of EBB's correspondents, and then, in snowball fashion, of their correspondents. Thus the collection even in its present limited form not only represents as writers or recipients virtually all the women "intellectuals" of the period, but also shows the many links amongst them. All the texts from my transcriptions are presently in electronic form. Where copyright regulations permit, the collection will be augmented by published letters, especially in nineteenth-century editions and, as archival work continues, by more unpublished letters. But even as it stands, the material meets the basic criteria for a corpus in that it is clearly demarcated by text-type, social group, and historical period boundaries, as well as having biographical, bibliographical and textual parameters.
The Linguistics Perspective: Corpus Linguistics
Because corpus linguistics, as Nelleke Oostdijk phrases it, "aims at the study of actual language use" (Oostdijk 1991: 19), the first corpora -- such as the Survey of English Usage begun in 1959 of varieties of contemporary English in written and spoken forms, the London-Lund Corpus of spoken discourse, and the Brown Corpus of American English c1961 -- concentrated on representing the diversity of uses and forms of twentieth-century English. Jan Aarts, in his paper "Corpus Linguistics: an Appraisal", remarked that "corpus linguistics also deals, by necessity, with all aspects of language variation, individual, social and regional", and challenged corpus linguists by suggesting that "one of the aims Corpus Linguistics should set itself is to describe the structure of texts, not merely sentences" (Aarts 1988). Oostdijk posited that corpus linguists had to "come to terms not only with language structure but also with all its relevant extra-linguistic correlates" (Oostdijk 1991: 21), the concern of stylistics and sociolinguistics. The first step, Oostdijk maintained, was to identify the sets of text-internal, linguistic features that determine text-type as well as the text-external features that delimit genres. Study of these correlates is essential as a basis for the study of language variation, but "corpora which have been compiled with the intention of representing a cross-section of the language are not suited for the study of linguistic variation" (Oostdijk 1991: 39). To meet these challenges, then, some corpus linguists have moved in the direction of designing corpora with variously defined synchronic and diachronic parameters. I shall touch on four such projects.
The Helsinki Corpus of Historical and Dialectal English
The diachronic part of this corpus contains more than a million and a half words of British English from 850-1720, many of which are in epistolary texts. Its compilers, Merja Kyto and Matti Rissanen, have said that "the primary purpose of our diachronic corpus is to serve as a database for the variational diachronic study of English morphology, syntax and vocabulary", contending that "change in language can best be approached and described through synchronic variation" (Kyto & Rissanen 1988: 169). Because some understanding of the structures of the spoken language is essential to the study of language change, texts that stand in different relations to spoken language must be selected. This relation is "determined, among other things, by the communication situation, genre and register, and the socio-educational status of the persons involved in the production of the text" (Kyto & Rissanen 1988: 169-70) -- all extra-linguistic features. Each text in the corpus, therefore, is described by a "fairly detailed set of parameter codings" (Kyto & Rissanen 1988: 170) that contain this information. As the focus of study in this corpus is at the word and sentence level, genre questions have been restricted to labelling of text-types and registers (the type of communicative event), and the style parameter was eventually dropped.
A Representative Corpus of Historical English Registers (ARCHER)
The compilers, Douglas Biber and Edward Finegan, described their corpus in 1993 as "part of a project designed to investigate the diachronic relations among oral and literate registers of English between 1650 and the present" (Biber, Finegan & Atkinson 1993: 1) Facilitating both diachronic and synchronic investigations, the corpus also enables investigation of what they describe as "microscopic issues, including a) individual 'style' [...] and b) variation within works" (Biber, Finegan & Atkinson 1993: 2). In an earlier paper, they had contended that "very few linguistic studies to date [had] analyzed the diachronic evolution of genres" and had described why: historical linguistics was focused on phonological and syntactic levels; sociolinguistics and discourse analysis gave little consideration to diachronic issues; and stylistics, where "considerable attention [had been] given to comparative analysis of different 'period' styles", was undertaken from rhetorical or literary perspectives and did not "reflect much linguistic sophistication"; therefore, they employ "analytic techniques developed for sociolinguistic analyses of register variation to study the historical development of genres over the last three centuries" (all Biber & Finegan 1988a: 22). Their method is described as a multidimensional/multifunctional approach (Biber & Finegan 1988b). Five dimensions that represent functional parameters of variation associated with differences in communicative situations are identified and are defined by particular sets of linguistic features. The first dimension, "informational/involved" or, as it might be expressed, a greater or lesser distance in the relation between writer and reader, is shown in Figure 1. The database is made up of texts representing eleven written and spoken registers (including epistolary texts) divided into ten 50-year periods from 1650-1990. Although their letter register is very small and contains only published letters written by men, at least for the nineteenth century, their findings reported so far have considerable interest for my own work. Not unexpectedly, letters on the whole show a movement from being less involved relationally with the recipient to being more so over the three hundred year period (Figure 2); more surprisingly, they show in the mid-nineteenth century a drop back to relational levels of the seventeenth century before rising by the end of the century to early eighteenth-century levels and then continuing to rise in the twentieth century (Figure 3). Why? Gender relations are also interesting: for the only time in the three centuries, nineteenth-century letters show no difference in the relations between women and men except when women write to men. Men writing to men, women writing to women, and men writing to women all show the same degree of relational involvement (Figure 4). Would comparable analyses on a larger number of nineteenth-century epistolary texts yield the same result? Would more detailed linguistic analysis support those results and/or be able to describe in what way the shift in the first dimension took place?
The Cambridge-Leeds Corpus of Early Modern English (1600-1800)
The inspiration for this corpus comes both from the Helsinki Corpus and ARCHER. According to Susan Wright (Wright 1994), the corpus is
designed to reflect and accommodate the fluidity of genre distinctions as the period progresses by focussing on authors rather than text-types as representative of the state of the language. However, genre remains available as an isolating feature of a group of texts without determining the selection of texts. To ensure compatibility with other historical corpora, [the] texts are given a generic characterization [...] and functional or situational criteria, common in sociolinguistic approaches. (Wright 1994: 26)
In her illustration of why "the distinction between literary and non-literary texts becomes particularly problematic when applied to [the] period" (Wright 1994: 26), Wright discusses the blurred boundaries between private and public in the personal letter, a distinction I also query in my own work on the form. She questions the usual designation in corpus linguistics of the letter as an "involved" or informal form because of its literary status, saying "there is a significant situational difference between letters written for publication and letters written privately" (Wright 1994: 27). But my own work on the form has led me to believe that "written for publication" is not a very useful distinction. Other than letters written expressly for publication such as letters to the newspaper, when are letters written in any other way but with an assumption of "privacy" -- even by "famous" authors who might well be aware, especially in the nineteenth century with its interest in biography, that such letters may eventually be published? To be more precise, a letter is usually written to a designated recipient in the full awareness that it may become public, that is, read or heard by others, unless the recipient destroys the letter after it has been read. All letters, because they are written documents, depend on the discretion of the recipient for their privacy. Continuing work on situational/functional criteria and the correlating linguistic features derived from studies of contemporaneous idiolects may well change generic assumptions, particularly for the letter with its ambivalent relations with written or oral, literary or non-literary, forms of discourse.
The final studies I wish to mention very briefly are those taking place on "Electronic Language", known as EL, used in e-mail communications, a contemporary variant of the epistolary form. One such study by Collot and Belmore used the Survey of English Language Corpus and a privately-collected corpus of e-mail communications, to hypothesize that, as "EL has unique situational features", it would embody "a distinctive set of linguistic features as well" (Collot & Belmore 1993: 42). They report, so far, clearly different uses of comparative adjectives. They also refer to a subsequent study on EL by Herring that "suggests a new textual dimension, a functional continuum ranging from 'adversarial' to 'attenuated' in which gender and type of discussion play a major role in the distribution of linguistic features" (Collot & Belmore 1993: 53).
Thus as smaller, specialized corpora are developed to study, under a general umbrella of language variation studies, specific genres, authors, and situational contexts more closely, the research findings will have considerable relevance to literary scholars working on either individual authors or genre and discourse studies. The challenge in designing an epistolary corpus appealing to linguists will be to identify texts with coded headings covering extra-linguisitic features, to supply supporting detailed biographical information and word counts, and to facilitate the creation of subsets including a parsed one.
The Literary Perspective
The immediate benefit from an epistolary corpus of nineteenth-century letters would be to make available an electronic archive of literary women's letters which, in all probability, will never be published in scholarly editions. Many of the letters' authors are relatively obscure, of interest perhaps only to feminist researchers. Yet literary scholars also need to know much more about professional women writers in the period. The rise of the Victorian "man of letters" has been documented, but the equivalent study of the "woman of letters" -- the women who made a career of writing in forms other than the novel -- has as yet received only sporadic or isolated attention. With careful tagging for indexing purposes, such a corpus would aid research of this kind. Epistolary texts, however, offer greater challenges than their general inaccessibility due to the obscurity of their writers' lives. There are simply too many letters; they are too tangled; and they are too inscrutable.
How many is too many? Unless wholesale destruction has taken place, a single Victorian letter-writer's output may over a lifetime run to thousands of texts. For instance, in the British Library alone, there are over 13,000 of Florence Nightingale's letters. And letter texts are sprinkled over the world where they often rest in anonymity in autograph albums, private collections, and respositories large and small. Collecting even a single writer's letters for publication is a daunting task, and no edition can ever with much assurance be considered complete. Neither volume nor binding is, however, a determinant of an electronic edition. An epistolary corpus can be flexible, and is not necessarily telelogically driven.
How can epistolary texts be tangled? Letters are tiny texts complete in themselves which can be studied in isolation. Yet individual letters are also a part of a larger whole: the correspondence between the writer and the recipient. Interpretation, therefore, depends on the availability of the "other half". Letter texts must be studied in pairs, pairs that overlap, as long as the writers remain separated, each letter being a response to a former communication as well as the initiator of the next communication. Conventional publication cannot very easily accommodate this Janus-like feature of the epistolary text. The ambitious edition of the Brownings' letters, which has recognized this important characteristic by including both sides of the correspondence, is slated to run to forty volumes -- a more than life-time project as only about a dozen volumes have appeared in as many years. And although it is carefully indexed, the massive size of the edition will make searching for information or even tracing a single correspondence over a number of years very cumbersome. A corpus arranged by the correspondences between interactants can, however, capture this feature and through subsetting, allow the user to retrieve a single correspondence, or profile a single author, or display a network of correspondences within a particular time period. The tangle becomes instead a web whose strands may be examined synchronically or diachronically.
How can a letter, a written document, be inscrutable? Interpretation of epistolary texts, as I indicated above, depends on entire correspondences: the writer's self is constructed and revealed in the felt presence of the absent addressee. It depends as well on entire texts. In many nineteenth-century editions, letters are published in truncated or censored form, and even twentieth-century editions often omit or re-locate spatial and temporal identifiers and subscriptions, all formal constituents necessary for interpretation. But above all, interpretation depends on an understanding of the form itself, an understanding not as yet based on taxonomic or typological criticism. Those few literary scholars who study letters as texts tend to make as many a priori judgments as do linguists. As John Burrows points out, "[i]f the computer is to contribute as much as it might in literary studies, the wider-ranging studies of the future will need to include much more work than has yet been done on genre differences and on historical change in the language of literature" (Burrows 1992: 188). An electronic corpus of complete epistolary texts based where possible on manuscript transcriptions, with textual features tagged for comparative purposes, will aid literary scholars in their study of the form.
Linguists and literary scholars, it would appear, have much to offer each other in the study of a long-neglected genre. A corpus of epistolary texts such as I envision, carefully designed to accommodate both the requirements of researchers with different aims and the unique characteristics of the text-type, will provide an opportunity for unusual side-by-side, perhaps even collaborative, approaches to the nineteenth-century letter.
 See Lancashire & Wooldridge 1994.
 See Johansson 1991 for a brief description of these and other corpora.
 See Wright 1994 for a discussion of some problems arising from established typological and functional criteria in corpus-based linguistic analysis.
- AARTS, J. (1988). "Corpus Linguistics: An Appraisal", Paper read at the Fifteenth International Conference on Literary and Linguistic Computing, Jerusalem, June.
- AARTS, J. & W. MEIJS (1990). Theory and Practice in Corpus Linguistics, Amsterdam: Rodopi.
- AARTS, J., Pieter de HAAN & Nelleke OOSTDIJK, eds. (1993). English Language Corpora: Design, Analysis and Exploitation, Amsterdam: Rodopi.
- BIBER, Douglas & Edward FINEGAN (1988a). "Historical Drift in Three English Genres", Georgetown University Round Table on Language and Linguistics (ed. T. Walsh), Washington: Georgetown UP: 22-36.
- BIBER, Douglas & Edward FINEGAN (1988b). "Drift in Three English Genres from the 18th to the 20th Centuries: A Multidimensional Approach", Corpus Linguistics: Hard and Soft (ed. Kyto et al.), Amsterdam: Rodopi: 83-101.
- BIBER, Douglas, Edward FINEGAN & Dwight ATKINSON (1993). "ARCHER and its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers", English Language Corpora: Design, Analysis and Exploitation (ed. Aarts et al.), Amsterdam: Rodopi: 1-13.
- BURROWS, John F. (1992). "Computers and the Study of Literature", Computers and Written Texts (ed. C.S. Butler), Oxford: Blackwell: 167-88.
- COLLOT, Milena & Nancy BELMORE (1993). "Electronic Language: A New Variety of English", English Language Corpora: Design, Analysis and Exploitation (ed. Aarts et al.), Amsterdam: Rodopi: 41-55.
- JOHANSSON, Stig (1991)."Times Change, and So Do Corpora", English Corpus Linguistics: Studies in Honour of Jan Svartvik (ed. K. Aijmer & B. Altenberg), London: Longman: 315-35.
- KYTO, Merja & Matti RISSANEN (1988). "The Helsinki Corpus of English Texts: Classifying and Coding the Diachronic Part", Corpus Linguistics: Hard and Soft (ed. Kyto et al.). Amsterdam: Rodopi: 169-79.
- KYTO, Merja, Ossi IHALAINEN & Matti RISSANEN, eds. (1988). Corpus Linguistics: Hard and Soft, Amsterdam: Rodopi.
- LANCASHIRE, Ian & T.R. WOOLDRIDGE (1994). Early Dictionary Databases. CCH Working Papers, 4.
- OOSTDIJK, Nelleke (1991). Corpus Linguistics and the Automatic Analysis of English, Amsterdam: Rodopi.
- WRIGHT, Susan (1993). "In search of History: English Language in the Eighteenth Century", English Language Corpora: Design, Analysis and Exploitation (ed. Aarts et al.), Amsterdam: Rodopi: 25-39.
- WRIGHT, Susan (1994). "The Place of Genre in the Corpus", Corpora Across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora (ed. M. Kyto, M. Rissanen & S. Wright), Amsterdam: Rodopi: 101-7.