Introduction

DICO, developed at the Istituto per gli studi semantici e cognitivi (ISSCO, Geneva), is a network-based dictionary consultation tool whose purpose is to enable users on a computer network to consult monolingual or bilingual dictionaries.[1] Its main characteristics are that

  • it runs in a heterogeneous computer network;
  • it is simple to use (not only for trained users);
  • it displays the dictionary entries in a form similar to the printed dictionaries;
  • it provides more search possibilities than merely looking up the headword; and
  • it avoids multiple copies of the dictionary data.

1. Architecture

The system is divided into two distinct parts: (1) dictionary servers responsible for the database operations and (2) user interfaces, the clients of the server, which are responsible for transmitting queries from the user and for displaying answers. Clients and servers communicate over the network. The dictionaries reside on the server's machine, and consultation of the dictionary entries is done exclusively through the user interfaces.

1.1. The Dictionary Server

A dictionary server may serve multiple dictionaries and multiple clients. It gets queries from a client through the network, performs the requested operation, and sends the answer to the client. The two main operations are finding which dictionary entries correspond to a given search key, and retrieving the content of an entry. Additional operations include retrieving the next (or previous) entry in the dictionary sequence, retrieving information text, changing the search mode, changing the display format style, and retrieving various lists of what is available on the server (names of dictionaries, names of information texts, search modes, and names of display format styles).

For DICO, a dictionary is a collection of text entries with their associated headword. Headwords constitute the primary access keys: they can be searched by exact or prefix match (both use binary search internally), or regular-expression patterns (sequential search). In addition, there is the possibility of searching through secondary access keys. For example, one can request all entries in a bilingual dictionary whose translation part contains a particular word. Secondary access is implemented with hash code tables.

1.2. The User Interface

The user interface client shows on the user's screen what is available, gets input from the user to change settings or initiate queries, communicates with the dictionary server through the network, and displays the results of queries.

An answer to a query using a search key is a list of headwords, which are displayed on the screen. The user can then pick one headword and ask to see the associated entry. Other interactions include selecting settings through menus or switches, scrolling text and requesting help information.

When displaying an entry, the client has to format it according to the characteristics of the screen and the current formatting style. The text of the entry is first processed through a transducer (which is dictionary dependent) that produces elementary formatting instructions (which are screen independent) that are then interpreted by a formatter and displayed.

Two clients have been programmed: xDICO, for use under X-Windows (pixel-based), and tDICO, for use on standard terminals (character-based). While xDICO provides the possibility to interact with a mouse and to display text in various fonts, tDICO is a full screen interface (i.e., it is not line-oriented). Both can be run remotely, with the display on the local machine. For example, dictionary consultation is possible from a PC, provided a terminal connection can be established (via telnet or a serial line) to a machine which runs tDICO. See Figure 1 for a typical xDICO screen.

1.3. Client-server Communication

The client process initiates the dialogue by sending a connection request to a permanently running server process, which after some checking, accepts the connection, spawns a child-server process responsible for that client, and resumes waiting for connection requests from other clients. Client- and child-server work in synchronized loops: the client interacts with the user, sends a query to the server, waits for the answer, and resumes interaction with the user. The server waits for a query from the client, processes the query, sends the answer and resumes waiting.

There may be many dictionary servers on a network. Clients need to know which dictionaries are available and where their servers are. They can request this information from an auxiliary program, the information server. So, initially, a client needs only to know how to contact the information server. There can be many information servers on the network and they do not need to run on the same machines as the dictionary servers.

2. Dictionary Representation

A dictionary on electronic media may be obtained in several ways: word-processing format, optical character recognition, or hand-typed from a printed version. Even though the TEI (Text Encoding Initiative) will give guidelines on how to code printed dictionaries for exchange purposes, in most cases, parsing of a dictionary entry is difficult because

  • there are errors in the representation (typographical errors due to file corruption, etc.);
  • the structure of similar entries is not always consistent (there may have been different lexicographers at different times); and
  • there are few easy clues: you need language understanding to disambiguate some constructions.

DICO presents a number of special features, which require that some processing work has been performed, but which allow both user-friendliness and flexibility of use.

2.1. Indexes

In DICO, a dictionary is a collection of textual entries. Each entry is associated with at least one main access key (the headword or headphrase of the entry) and possibly some secondary access keys (text).

A secondary index is an N-to-N mapping of access keys to entries. Index creation is done off-line, as it may involve external data and heavy processing. Search is done with full access key (by hashcode), and references to the entries are kept in lists or bitmaps. The keys are arbitrary, and a key does not have to appear in the content of an entry.

Secondary indexes include the following:

  • domain information codes (BIOLOGY, THEATRE);
  • reverse index: all words appearing in the definition or translation part of a word (possibly lemmatized); and
  • synonyms, or thesaurus.

This index design presents a number of advantages:

  • Simplicity. There are few constraints on the dictionary representation because the content of an entry is contiguous, all entries are in one file, in any order, and are not necessarily contiguous, and there are no significant limits (except those required by the hardware and software environment) to the size of an entry or the number of entries.
  • Expandability. New secondary indexes can be added easily.
  • Speed. The main index is in memory, sorted, and the secondary indexes are hashcoded. Therefore the real work is done off-line, and only once. Approximate matching allows easy entry of search keys. When searching, the user may choose to ignore the case of letters (e.g., the key march would match both headwords March and march), or to consider only the base character of accented letters (e.g., the key eleve would match both headwords élève and élevé).

2.2. Display Format

Entries are dynamically formatted on the screen, according to the current dictionary, and the chosen display format and screen width. For each dictionary a set of formatting styles is provided. These are instructions (grammars) for the transducer (which is a push-down automaton) that can be loaded from the server when selected by the user. There are at least two formatting styles: source style, which shows the data as it is stored, and nice style, which uses indentation and font changes to show an entry in an easy-to-read layout. Other formatting styles (such as sgml, which displays the SGML version of an SGML-coded dictionary entry) may also be implemented, according to the needs and requirements of the user. For an example of source and nice representations, see Figure 2.

2.3. Special Characters

Currently DICO supports ISO-Latin1 alphabet and the IPA phonetic alphabet. Sometimes the user's equipment or inexperience does not allow typing or displaying special characters. Moreover, in a heterogeneous environment it is not always easy or possible to input or output accented letters. Therefore the user interface proposes special modes either to type or to display accented letters as pairs: for example, "a`" stands for à.

2.4. Alphabetical Order

Alphabetical order is apparent when displaying a list of matching headwords and when asking for the next or previous entry. Sorting rules are specific to each language:

  • different alphabets: in Danish æ, ø, and å are sorted after z;
  • grouped letters: in Spanish ch is sorted after cz, and ll after lz;
  • secondary sort: in French the accent on a character is used as a secondary key.

For example,

WORD        PRIMARY INDEX     SECONDARY INDEX      (TRANSLATION)

mais            mais                I                 (but)
maïs            mais                ï                 (corn)
maison          maison              I                 (house)

In DICO a description of the alphabetical order has to be associated with each dictionary.

2.5. Protection

A lot of attention has been paid to protecting copyrighted dictionaries: in particular, the dictionary source is encrypted and resides on only one machine, which can be protected. Entries transmitted through the network are further encrypted (client-dependent). The server checks the origin of connection requests from clients and accepts only authorized ones.

3. Implementation

DICO is programmed in C, using widely available libraries. It runs under the Unix operating system (SunOS 4.1, X-Windows X11R5 for xDICO). A prototype implementation has been installed and made available on the Sun computers of the network of the University of Geneva since June 1991. Currently nine dictionaries are available for consultation (all copyrighted): six pocket-sized bilingual dictionaries from Harper-Collins (French-German, German-French, French-English, English-French, German-English, English-German), the two volumes of the Oxford Dictionary of Current Idiomatic English (verbs and phrases), and the Oxford Advanced Learner's Dictionary of Current English.

The University of Toronto is also a beta-test site for DICO.

4. Using DICO with Historical Dictionaries

While DICO has so far only been used with modern dictionaries, it may of course be used as a tool for consulting any type of dictionaries, for any language and from any period. Dictionaries do not have to formatted in any special way. However, in order to include early dictionary texts in DICO with the purpose of getting interesting information out of them in later work, it makes sense to first tag them in a sensible way.

Several pages of the French-English Cotgrave dictionary (in fact the whole A section) have already been included for DICO.[2] To be convenient, and to follow the current trend for standard textual representation, the entries have been coded with a suitable SGML DTD. Unsurprisingly this work has revealed a number of inconsistencies in the original lexical entries. However, most of these should be ironed out with the help of Unix utilities and with an SGML parser.

When used with bilingual dictionaries, in which headwords are described in terms of equivalences in other languages, DICO allows users either to search for headwords (in one language) or to invert the entries, that is, to reverse their directionality. This can be done by using elements of the translation field in the lexical entries as keys for a secondary index. We can then inquire about, and retrieve, any target-language word hidden in the translation given in the lexical entry of the headwords.

While accessing the French words in the Cotgrave (the headwords of the lexical entries) only requires building the primary index based on the list of the headwords, accessing English words (which are the targets of the entries) must be done through a secondary index. Building this secondary index requires coping with the irregular spelling of the English words, and thus developing an index in which all the variant forms of an English word may be retrieved by querying on the lemma.

Lemmatizing these texts is a sizable task. Nevertheless, it would allow research on early English lexis and syntax as documented in early bilingual dictionaries, even those in which all headwords are for the other language. Many early dictionaries are alphabetized by the foreign-language word, so that the rich English equivalents, quotations, and notes following them are very hard to find in a paper book, but an on-line consultation tool of an electronic version would allow us to retrieve this information by indexing the entries with the English words.


Notes

[1] Part of this description of DICO is adapted from the text for another presentation of the tool, done by Dominique Petitpierre and Gilbert Robert, the developers of the tool at ISSCO (Petitpierre & Robert 1992).

[2] Ian Lancashire of the University of Toronto has provided this text.


Bibliography

  • PETITPIERRE, Dominique & Gilbert ROBERT (1992). "DICO, A network based dictionary consultation tool", Meeting of the Swiss Group for Artificial Intelligence and Cognitive Science (SGAICO 1992). Neuchâtel, Switzerland.