1. Introduction (B. Merrilees)

The Dictionarius of Firmin Le Ver (DLV) is a very large Latin-French dictionary compiled at the Carthusian house of St. Honoré at Thuison, near Abbeville in Northern France, in the first half of the fifteenth century. The text is preserved in Paris, Bibl.Nat. nouv.acq.fr. 1120 where it takes up 467 of the manuscript's 478 folios, and contains a total of 12,800 headwords, plus 37,700 sub-headwords, in all a text of 540,000 words, giving an electronic file of 4.5 Mb.

In 1986 William Edwards and I set out to produce a critical edition of the DLV, a project now virtually complete, and although we have always intended to use the text base of the dictionary for a variety of purposes, the production of the edition has remained paramount; we might have proceeded differently if we had had to consider the DLV as only a textbase for analysis and exploitation. Certainly we might have marked the text more than we did, but at the beginning we were unaware of many features of the text that would later capture our attention.

The text was entered in WordPerfect (initially 4.0 and 4.1, later 4.2, now converted to 5.1) and set out in a manner that aimed at representing, as best we could, the layout of the dictionary entries on the manuscript page. The entries in the DLV are better termed macro-entries: most are made up of a headword, marked in the manuscript by a coloured initial capital and followed by one or more sub-headwords, set at the left margin and beginning with a capital in the ordinary brown ink of the text. The definitions, which are in Latin and French are also in ordinary ink, though for the first few folios a hand, not Le Ver's, has underlined the French. This practice is found elsewhere in bilingual dictionaries. We entered the text respecting the line length of the manuscript column and used the WordPerfect codes on a colour screen to emulate some features of the layout. Bold capitals were used for the headwords, bolding with a single initial capital for the sub-heads and underlining, now italics, were used to set off the French. The Latin of the rest of the definitional and attributional material, by far the bulk of the text, was in regular type. The printed text looks like this:[1]

    ABISSUS  ab *a, quod est sine, et *bissus componitur
    Abissus .ssi  abisme profunditas aquarum                f
        impenetrabilis vel spelunca aquarum latitantium
        unde fontes et flumina procedunt, scilicet pelagus
    Abissus  eciam dicitur profunditas scripturarum

    ABITIO .tionis  .i. recessio  In ¶Abeo, abis dicitur

    ABIUGO .gas  ex *ab et *iugo .gas componitur            act
    Abiugo .gas  media correpta  .i. a iugo separare,
        dissociare, abgregare desjoindre, separer
    Abiugatus .a .um  desjoins, separés, divisés            o

    ABIUNGO  ex *ab et *iungo componitur                    act
    Abiungo .gis .xi .ctum  desjoindre longe
        iungere, separare, segregare, dividere
    Abiunctus .a .um  desjoint separatus, semotus
    Abiunctim  adverbium  separeement, desjointement

    ABIURO .ras  ex *ab et *iuro .ras componitur
    Abiuro .ras .ratum  .i. periurio negare                 act
        .i. deneer, nier par mentir, par parjuremens
    Abiuratus .ta .tum  niés par mentir
    Abiuratio .tionis  .i. rei credite abne-                f
        gatio, periuratio, inficiatio deniemens
    Abiurtio .tionis  idem, per sincopam                    f

    ABLACTO .ctas componitur de *ab et *lacto .ctas         act
    Ablacto .ctas .ctatum  ensevrer, sicome
        enfant on oste de la mamelle .i. a la-
        cte removere, extrahere et separare
    Ablactatus .a .um  .i. ensevrés a lacte ex-             o
        tractus, semotus, separatus a mamilla
    Ablactatio .tionis  ensevremens                         f

    ABLATUS .ta .tum  osteis remotus,                       o
        separatus, semotus  ab *aufero .fers dicitur
    Ablativus .a .um  qui aufert qui oste, ostans           o
    Ablativus .tivi  quidam casus ablatis                   m
    Ablatio .tionis  ostemens semotio                       f

During the period it took to enter the text and beyond we continued to work on various ancillary studies, questions of sources and transmission, the status of French in the dictionary, the nature and function of metalanguage (Merrilees, 1988, 1990, 1991; Merrilees & Edwards, 1989). This last has led to further work on the structure of the dictionary entry, an extension of the visual aspect noted above (Merrilees, 1992). The main components of a dictionary entry are the lemma (headword or sub-headword) and the definition, but around these two poles there can be various markers and much additional information about the lemma, its attributes and function. In the DLV this additional information is distributed in three locations and we have found that each position appears to privilege certain kinds of information.

The three positions are post-lemmatic, post-definitional, and marginal (the right margin); their functions are:

  1. 1. The post-lemmatic position in the DLV is mostly reserved for definitional connectors, parts of speech other than noun or verb (e.g. adverbium, prepositio), phonetic information expressed absolutely (e.g. media correpta, penultima producta), 'accidents' other than gender or voice (e.g. pluraliter, diminutivum), etymologies, compounds and derivations (e.g. ABISSUS ab *a, quod est sine, et *bissus componitur,), indications of the language, usually Greek (e.g. ACCIDIA accin grece, latine dicitur cura).
  2. 2. The post-definitional position is used for derivation, especially inde and unde but also longer expressions (e.g. ABLATUS [...] ab *aufero .fers dicitur), 'accidents' (e.g. et comparatur), phonetic and orthographical information in full expressions (corripitur, producitur), external and internal references (e.g. ABITIO .tionis [...] In ¶Abeo, abis dicitur, Pastoralia [...] Amos primo capitulo dicitur), remarks concerning usage (e.g. Calvo [...] sed non est in usu), including the so-called versus memoriales. Our present definition of post-definitional includes information appearing after any part of a definition, which sometimes means after the definition in Latin but before the definition in French. There are several possible patterns.
  3. 3. The marginal position privileges 'accidents' of gender and voice which appear in the right margin in abbreviated form (m, f, n, act, etc.), but also allows reference to authorities, a feature deriving from manuscripts of Papias' Elementarium, or an indication that exemplary verses are present, marked by V or Versus. The last two can even be outside the drawn margin of the column, technically extra-marginal.

Concording programs, such as WordCruncher, can easily pick up the metalinguistic vocabulary under analysis, but they are not well suited for dealing with component elements of a dictionary article as these stand in relation to one another. Nonetheless we have had useful output from WordCruncher, which William Edwards describes here, and with an indexing and concording program that David Megginson reports on later in this paper.

2. WordCruncher: strengths and weaknesses (W. Edwards)

WordCruncher promotes itself accurately and concisely as text indexing and retrieval software; that is, a two-stage process involving the retrieval of data from pre-indexed DOS text files; in fact, the program operates as two interrelated, but distinct components, namely indexing and viewing. The program, in short, massages any DOS text file, creating in the process a word-frequency list which the user can then manipulate to gain random access to the text file and to retrieve data. The program's strength is the ease with which designated character-strings, suffixes and prefixes, or various combinations of words or letters can be searched; in our preliminary work of transcribing, entering and checking the dictionary text content, that searching capacity was invaluable. However, as we moved to an analysis of the structure of our text, as prepared, WordCruncher had its limitations, though we should point out that such limitations relate as much to our application as to the program itself. For example, WordCruncher can list all occurrences of a particular word, but cannot identify the most frequent word in the post-lemmatic position; it can list all French occurrences of the suffix -iet but cannot identify the frequency of French in the definitional position; it can provide all examples where words 1 and 2 are followed within a designated number of spaces by word 3 and/or 4, but the program cannot identify the schematic structure of an entry. In the initial stages this was less a concern to us than the capacity to have rapid access to the textbase.

WordCruncher will triangulate a given reference, provided the user has prepared (pre-indexed) the text according to a three-tiered system. The generation of publishable indices in a variety of formats -- book index, key-word-in-context, key-word-in-line -- is apparently the controlling principle of pre-indexing. We found that, for our purposes, a basic, 'untreated' text file in DOS format, with a unique filename, will provide access and data retrieval as satisfactory as one that has been given a more sophisticated pre-indexing treatment; additional preparatory marking, or adapting the text file to the hierarchical structure suggested by the program, yields limited further returns. Even if we had marked our text more than was done, the structure of the dictionary article could only have been partially captured by the three-level hierarchy.

With whatever level of sophistication the text is prepared, the program produces a word-frequency list, which is then used to mark the text, and through which both the program and the user access the file. The levels of indexing, or lack of them, do not in any way affect this random accessibility, nor the power of recall: lists of specified citations, modified or not, are available to the screen, as a DOS file, or can be sent to the printer.

As Brian Merrilees has shown, our medieval compiler greatly anticipated our task, simplifying the need for the detailed marking of our text, using principles of lay-out as pre-indexing tools which we have chosen to reproduce. Alphabetically arranged dictionaries, after all, are already largely pre-indexed.

The principal benefit of concording Le Ver's text is to provide random access to the French imbedded within the Latin text. It was hoped initially that WordCruncher's three levels of referencing could be used to provide a ready reference for each French word as follows: the designated French citation, in its extended, French context; its 'book' (dictionary and letter); together with its Latin referents: Latin headword and Latin sub-headword. However, extensive marking and preparation of the text to achieve this end yielded minimal advantages over the use of a largely unmarked text. For the Le Ver text the practical limitations of WordCruncher prevented a useful application of the three available levels of reference. As we progressed in our analysis, it became clear that a dictionary entry structure as we have described it above would have required a different kind of software. However, it was through WordCruncher's search results that we came to a fuller understanding of Le Ver's methods of compilation. For example, it was our searches of metalinguistic terms that confirmed the importance of the link between information and its location.

The preparatory efforts that proved the most useful, curiously, were not with respect to the preparation of the text file, but rather to the manipulation of the Character Sequence file, which provides five built-in default options for organising the generated word list: four default language sequences -- English, French, German, Spanish -- and a fifth, user-specific, personally tailored and modified sequence to reflect editorial practice; the dictionary of the text file's vocabulary, its list of unique words, can be here adapted to any hierarchical sequence or equivalence, as the user decides. Every ASCII character can be assigned, by the user, to one of seven types: Upper case, Lower case, Delimiter, DelimLower (marks the end of a word and is a separate word in itself, and can be searched as such), Hyphen, Apostrophe, or Ignore; the text will be indexed and the word-frequency file sorted accordingly.

The most immediate and useful product of a crunched text is this word-frequency file -- a list of all words found in the text, with the frequency of usage of each, sorted according to the designated character sequence. In addition to being an integral point of access by the program to the text, under the View option, this file can be manipulated as a generic word-processor file in its own right -- particularly useful as a proof-reading device, when special attention can be paid to single frequency occurrences. Further, it can be manipulated and sorted by frequency, suffix, etc. as a word processor file in its own right.

In this project our main purpose, which was to provide a traditional (i.e. printed) edition of a medieval manuscript text was obviously somewhat at odds with the preparation of a text for electronic manipulation. Nonetheless, WordCruncher proved to be a powerful and useful tool, but within prescribed parameters.

3. Old-Fashioned Concording (D. Megginson)

Electronic concording programs like WordCruncher create interactive concordances: the user decides what information to retrieve while using them. Printed concordances like the Microfiche Concordance to Old English Literature, on the other hand, are static concordances: the editors decide how to organise the information, and the user can access it only in that way. Interactive concordances are very useful tools within research projects like the Dictionarius Le Ver, but when we want to share our work with other scholars, they are still unsuitable for several reasons.

The first problem is distribution. An interactive concordance requires access to a computer, and if there is software bundled with it, it requires access to a specific type of computer. Scholars cannot simply pull the concordance off a library or bookstore shelf and browse through it, or bring it with them on research trips.

The second problem is the lack of standards among computers. Nearly all computers can exchange simple digits and Latin text using the ASCII or EBCDIC standards, but there is no universally accepted method for exchanging even something as simple as é or a 4-byte machine word (long integer), much less a complex binary file structure like the one used by WordCruncher. Today, an interactive concordance must be bound not only to a single computer, but to a single software package.

When it comes to publishing, static concordances avoid nearly all of the problems of interactive concordances. When they are printed on paper, they require no special technology to use, and they can follow standards of typesetting and book-binding which are already well established. Printed concordances also take advantage of the existing distribution system of book sellers and libraries to reach the largest possible audience, and are easy to bring into research facilities for field work.

Furthermore, looking up a single, complete word in a printed (paper) concordance can be as fast as looking it up in an interactive concordance on a computer. However, there are several serious disadvantages to printed static concordances.

First of all, static concordances always limit the user's choices in ways that interactive concordances do not. If a concordance is in alphabetical order, the user can find all words beginning with b grouped together, but not all words containing or, for example. Static concordances also allow only one way to access each citation: you can find all of the citations containing et and all of the citations containing on, but not the citations which contain both.

The second problem stems partly from the first. Concordances are very long, and become even longer when one tries to provide more options for the user. Even a simple, alphabetical concordance can be considerably longer than the original text. For example, if you are concording a 200-page text where the average citation is 40 words long, the concordance will be over 8,000 pages long in the same type size. If you add another type of listing, such as reverse spelling, the concordance will be over 16,000 pages long, and so on. Electronic interactive concordances can generate this information as required -- the average user will never need most of it -- but a static concordance must contain it all explicitly.

It will usually not be possible to publish an 8,000- or 16,000-page concordance printed on paper. The best alternative is microfiche, as the Dictionary of Old English project has done with its concordance. However, now the users are tied to a microfiche reader, and have already lost one of the greatest advantages of the printed static concordance -- its portability and freedom from technological constraints -- without gaining any of the advantages of interactive concordances. The only remaining advantage is that microfiche readers are more commonly available in libraries than computers. The rest of this paper will explore the options which we have considered at the Dictionarius Le Ver project to generate concise, useful static concordances for publication.

Usually, concordances show keywords in context, either with a fixed number of words on either side or within an entire quotation. The simplest way to generate a smaller concordance is to omit the context altogether. Here is a sample French concordance of an early draft of the Dictionarius Le Ver M section without context:

punir:

1 multo

punis:

1 multo (multatus)

punition:

1 multo (multatio)

pur:

1 merum (merum)

puree:

1 merula (merula)

purement:

1 merax (meraciter)
2 merosus (merose)
3 merus (mere)

purgier:

1 mucus (muco)

purgiés:

1 mucus (mucatus)

purifiés:

1 merax

purs:

1 merax
2 merosus (merosus)
3 merus

putain:

1 manzer
2 multicuba

puterie:

1 meretrix (meretricatio)

A lexicon like the Dictionarius Le Ver is ideal for this sort of concordance. A non-contextual concordance of a novel, for example, would have to list only page numbers, and would be difficult to use because a page contains so many different words. The Dictionarius Le Ver is organised hierarchically by headword and sub-headword, and each sub-headword passage contains only a small amount of French. Furthermore, unlike such references as "page 38" or "Act 3, scene 5", the headword and sub-headword still give a fair bit of useful information about a word's context. Still, in this concordance, we are considering including the surrounding French for more context. In the case of putain, for example, the entries would look like this:

putain:

1 manzer bastard fil de putain publique de bordel
2 multicuba putain qui couche aveuc chescun ribaude

Since our first concordance is considerably smaller than the original text, we are able to list other types of information. For example, Le Ver often includes etymologies in his entries, usually Latin or Greek. Since these are all unambiguously marked in the text, we can concord them separately, and study the use of etymology throughout the lexicon. Here is an extract from the concordance of etymons from the same M section:

cedo:

1 matricida
2 morticinus
3 morticinus (morticína)
4 muricida

centaurus:

1 monocentaurus

ceros:

1 monoceros

colera:

1 melan (melancolia)

In this case, the headword and sub-headword alone will often be all the context required, as with colera, in the melancolia sub-entry under the headword MELAN. The etymon concordance is very short, but it still presents a single type of information well.

Fortunately for us, Le Ver produced his lexicon in fairly good alphabetical order. However, while the headwords are fully ordered and there are many cross-references, it is sometimes difficult to find where a sub-headword is defined within a headword article. Again, we have marked the sub-headwords in the electronic text, so it is a simple matter to concord them. The final sample concordance is a list of sub-headwords with their corresponding headwords:

emembris:

1 membrum

emembro:

1 membrum

emendo:

1 mando
2 menda

emensus:

1 metior

emergo:

1 mergo

emeritio:

1 meritus

emeritus:

1 meritus

This concordance will be short, but it can be very useful, both for finding sub-headword articles within the lexicon and for studying the structure of the headword articles themselves.

The Dictionarius Le Ver project can also produce short concordances of Latin words, cross-references within the lexicon, cited forms and even marginalia, since we have marked all of these in our text. Rather than producing one large, awkward printed concordance with extensive context, we are concentrating on small, easy to use lists. Without the context, the user will have to make frequent reference to the text itself, but the printed (or microfiche) concordances will permit use of the text in many different ways, and in many different places.

We have generated these concordances using standard Unix shell tools, with all of the files in plain ASCII format. This is one of our best defenses against obsolete technology, since an ASCII text file is usually easy to convert to any format. The concordance files themselves are also plain ASCII, although for the sake of this paper we have converted them to proper foreign characters and added boldface and italics.

One day computers will be more standardised and more easily available. The Text Encoding Initiative (TEI), headed by Lou Bernard and Michael Sperberg-McQueen, is working to establish standards which will allow different computers to exchange all types of textual information. Once the new standards are in place and there are programs on the market using them, publishing an electronic interactive concordance will be simple and cheap. Until then, however, static concordances will remain the best option.

Throughout this part of the paper I have been careful to specify printed static concordances. It is also possible to release the text of static concordances in an electronic format, using plain text escape sequences for foreign characters like é. Users will be able to take advantage of their own software (for example, a wordprocessor with macros) to generate new types of concordances from it. There are already good distribution systems in place for electronic texts (as opposed to binary files), such as the Oxford Text Archive and the Usenet computer network. Perhaps this is the best compromise we can make for now -- releasing a static concordance, both as a printed text (on paper or microfiche) and as electronic text for further work by other computer-literate scholars.


Notes

[1] Editorial note. Web formatting and display limitations lead us to prefer non-proportional characters in place of the original article's proportional characters.


Bibliography

  • MERRILEES, Brian (1988). "The Latin-French dictionarius of Firmin Le Ver (1420-1440)", Zürilex '86 Proceedings. Tübingen: A. Francke Verlag: 181-8.
  • MERRILEES, Brian (1990). "Prolegomena to a history of French lexicography: the development of the dictionary in Medieval France", Romance Languages Annual (Purdue University), 1: 285-91.
  • MERRILEES, Brian (1991). "Métalexicographie médiévale: la fonction de la métalangue dans un dictionnaire bilingue du moyen âge", Archivum latinitatis medii aevi: Le Bulletin Du Cange, 50: 33-70.
  • MERRILEES, Brian (1992). "The Organisation of the medieval dictionary", Romance Languages Annual, 3: 78-83.
  • MERRILEES, Brian & W. EDWARDS (1989). "Le statut du français dans le dictionarius de Firmin Le Ver", Le Moyen Français, 22: 37-51.