What follows is a general description of the conception and development of SAToRBaseurl, an online database tool that contains some 3000 occurrences of topoi (recurring motifs) in French literature from about 1200 to 1800. SAToRBase has been developed expressly for the Société d'Analyse de la Topique Romanesque (SAToR)url, an international literary association. Its current development is led by SAToRCanada, an affiliate of SAToR. The purpose of this article is not to examine in detail each aspect of SAToRBase; that would be futile for an online tool of this kind which is in constant evolution. Rather, the purpose is to present more broadly the major characteristics of SAToRBase, including a brief description of typical procedures such as searching for occurrences of topoi and proposing new occurrences. The interface of SAToRBase is likely to change considerably over time, but the core concepts will no doubt remain stable.
Before considering SAToRBase in more depth, I would like to recall some significant moments in the efforts of SAToR to integrate the computer into its research (see Ramos 2003 for a much more comprehensive analysis). Despite early efforts to make use of existing tools (like TACT for lexicometry), SAToR recognized very quickly the need to develop a specialized application for the accumulation, management, research, and delivery of its accumulated data. In 1989 a research group from Montpellier (France) unveiled a prototype of what would soon after become TopoSAToR, the first application conceived of and designed specifically for SAToR. TopoSAToR underwent several iterations of development, sometimes motivated by a desire to improve the application, sometimes by external forces such as changes in operating systems.
The relationship between SAToR and the computer has remained tenuous throughout its history, despite clear successes. SAToRians, for the most part, recognise the potential of the computer as a tool, but many remain leery of its possible dangers. One of the founders of the SAToR expresses this mixed perspective as follows:
We must master this instrument, with the help and skills of our computing colleagues, but we must not let the instrument master us: the danger lies in the possibility that our research objectives be, in advance, informed by the computer, that logical frameworks, categories of invention in novels be imposed on our texts, where henceforth we would risk finding only what the computer would have led us to find, abstract schemas and mechanical combinations. (Coulet 31; my translation)
Programmers and literary scholars alike do well to heed these warnings that resonate with an almost timeless truth. Indeed, the more powerful and agile the tool, the more it is crucial to be weary of its effects on literary scholarship.
TopoSAToR is certainly a powerful and flexible tool, but it was designed for the stand-alone personal computer. Every user must obtain and install a copy of the program and work with the data that accompanied the installation files (or obtain new data in the same way as the program does). This model of distribution may work well when the value of a program resides in its functionality (like for Corel WordPerfect or Adobe Photoshop). However, the value of the SAToR Project does not reside in the functionality of the program (whatever it may be), but rather, in the dynamic data to which the program provides access. It is thus critical that 1) the data be as current as possible and 2) the tool for the exploitation of the data be as user-friendly as possible; these are two principles that fundamentally guided the development of SAToRBase.
Whereas TopoSAToR was designed for the stand-alone personal computer, SAToRBase is designed for the networked computer. SAToRBase implements techniques and technologies that simply were not available when the development of TopoSAToR began. Since the program and data are stored on a web server, the user has access to the most recent available version of the program and the most current data. As such, the personal computer is merely a window to data that is managed centrally on the server (more details on the technical details of the system will be provided in section 3).
The home page of SAToRBase (Figure 1) presents an entry-way that is both reassuring in its simplicity and rich in the pathways that it offers. One finds a brief welcome message that explains the purpose of the site (especially useful for visitors who have arrived somewhat serendipitously, as through a web search engine). There is also a link toward a Frequently Asked Questions (FAQ) page that provides relevant information about the site and its use; in other words, no skills are taken for granted. In the upper right-hand corner of the home page (and on every page in the site) is a navigation bar that includes a link to the FAQ and a link to the contacts page (site administrator, SAToR's council, etc.). At the bottom of the page one finds links to the pages of other research teams of SAToR (currently the “Centre SAToR France” and the “Women's Writers” team in Holland). Finally, the home page includes a link to login to the supplementary features of the site (the different levels of access will be explained below), as well as several links to navigate within the categories and the topoi and to execute simple and advanced searches.
When I first agreed to create an online successor to Toposator, I received an enormous flat-file that contained most of the relevant raw data. I had not used Toposator and, in fact, I knew very little about SAToR association. One might think that such ignorance would constitute a significant handicap, but to the contrary, it was precisely this tabula rasa that allowed me to conceive of and design a new tool that was not simply an online implementation of what already existed in stand-alone form.
My ignorant and naïve perspective on the raw data immediately led me to envisage a hierarchy for navigation of the nearly 1000 topöi enumerated (something that did not exist in TopoSAToR). A long sequential list of topöi may work well for scholars already familiar with the data (and SAToRBase also offers such a list), but I very much wished to make the tool accessible to neophytes as well. Figure 2 shows an example of navigation by categories.
The categories chosen have an entirely functional status, much like the categories of many web search engines. One does not use, say, Yahoo! to find its categories, the categories merely facilitate the searching and navigation of the indexed sites. Similarly, the categories in SAToRBase are like road signs that lead the user toward a destination (nevertheless, if in transit a sign grabs the attention of the user and leads to an unexpected treasure, all the better; serendipity is an aspect of scholarship that needs to be encouraged).
The search mechanisms of SAToRBase are another area where user-friendliness and power are given equal priority. By default, the user sees a short HTML form to execute a Simple Search, but one can also easily switch to an Advanced Search. The Simple search (Figure 3) allows the user to limit the request to words contained in the categories and topoi names and / or to perform a full text search of the occurrences (the anatomy of the occurrence will be discussed below in section 2.3).
Search words are highlighted in the results and clicking on any of the words in the results causes a search on that word to be performed. In addition a link is provided to an electronic version of the full text that contains the occurrence, when that text is available online (see the liens link in Figure 3). One can also work with the full text in HyperPo (HyperPourl is an online analysis and exploration tool - see Sinclair 2003b for more information about HyperPo).
Though a simple search should suffice in most cases, there are times when a more nuanced search may be desirable, or where one may want to modify the default format of the output. Figure 4 shows the level of detail that is available in an Advanced Search.
The user first specifies terms for which to search. One can search for whole words or partial words (for instance, the pattern possible would also match impossible). The three text boxes provides basic Boolean capabilities: 1) terms that must appear (and); 2) terms that may appear (or); and terms that must not appear (not).
Next, the user determines in which data to look for the terms. Moreover, only the fields that are searched are displayed. As with the Simple Search, the user can choose to perform the search within only the categories and topoi (the “branches” option toggles whether or not to display the full category hierarchy), and/or search within the full text of the occurrences. However, unlike with the Simple Search, the user can now specify in which fields to search for the terms (or which fields to display in the results, an easy to customize the output). In addition, the user can ask to search for the terms within the bibliographic references of the occurrences (and determine whether or not to display the occurrences linked to each reference).
Finally, the Advanced Search permits the user to determine the format of the results. The default format shows the results in HTML which is easy to read but relatively complex (nested tables, intense hyperlinking, etc.); this is ideal for most cases. The user can instead choose a simple HTML format that still looks clean and that may be easier to print or to cut-and-paste into a document or email message. The plain text format is even simpler and can be useful for generating data to be converted to other formats, since there is a minimum of codes to translate. The last choice is XML, which can be used to transfer data into other structured data formats with specialised tools. The XML format will likely be of little interest to most users, but it adds to the interoperability of SAToRBase its capacity to communicate with other online resources -- much like SAToRBase makes use of other available tools on the web (electronic texts, search engines, HyperPo), external resources should be able to make use of the data contained in SAToRBase; such is the duty of a responsible networked tool.
The core of SAToRBase is the occurrence of a topos, regardless of whether one arrives at it by navigating through the categories or by performing a simple or advanced search. The display of an occurrence (Figure 5) includes the following components:
Figure 6 shows an example of an occurrence such as it would be seen if one has not logged on to SAToRBase (only current members of SAToR can log on). Some functions are only visible to members (Figure 5), and only certain members with additional privileges see certain functions (for managing users, for instance). These variations demonstrate an important design principle of SAToRBase: keep the interface consistent and only display to the user what is relevant.
SAToRBase has the following levels of access, in increasing order of privileges:
Members of SAToR who have logged on will see a button labelled Nouvelle Occurrence (New Occurrence) in the topos strip (Figure 5); this button leads to an HTML form similar to the one in Figure 7. The input fields are as follows:
Once the update button (“Mettre à jour la base de données”) is clicked, the submitted occurrence is integrated immediately into the database with the status “proposed”). An email message is simultaneously generated by SAToRBase and sent to the members of the accreditation committee (who then take about two weeks to deliberate, modify and approve the submission before its status changes to “accredited”).
This section will provide a brief overview of the technical specification of the SAToRBase system.
SAToRBase is a server-side program written in Perlurl, containing approximately 2000 lines of programming code (given the variability of programming styles, this line count means very little).
One significant advantage of Perl as a programming language is its general portability; programs can usually be run on various platforms (Solaris, Linux, AIX Mac OSX, Microsoft Windows, etc.), with few or no modifications. Moreover, SAToRBase only uses modules that are included with standard Perl distributions (CGI.pm, SDBM_File.pm, POSIX.pm, locale.pm and their dependent modules). It should be relatively trivial to take an “image” of SAToRBase and install it on any machine with a web server. At the time of writing this article, SAToRBase runs on a modest Sun Workstation (UltraSparc 5 at 333Mhz and 128MB RAM), though it is expected that it will migrate to a high-end TAPoRurl (Text Analysis Portal for Research) server once they are installed.
The code from SAToRBase is copyrighted to the author and SAToRCanada and licensed under GNU GPL (GNU General Public Licenceurl). The data is copyrighted by SAToR, it is not available for distribution.
SAToRBase currently relies on the Simple Database Manager (SDBM) to store its data (SDBM has certain disadvantages but is one of the most universally available database formats available in Perl). The data fill about 20MB of space in the SDBM format (about 2.5MB in plain text format). SAToRBase currently contains about 2900 occurrences from 410 sources, organised into 1030 topoi.
SAToRCanada will continue to develop SAToRBase, and focus in particular on the following improvements:
- user customisation
- personalised history (searches, navigation, etc.)
- personalised categorisation of topoi
- more intelligent searches
- orthographic variations (especially between “old” and “new” spellings)
- morphological lemmatisation (searching for verb forms together, etc.)
- semantic clustering (searching for clusters of meaning)
- integration with external resources
- discussion forum for SAToRians
- links to more electronic texts
- specialised access to search engines
- integration with TAPoR and other analysis tools
Even though SAToRBase, as an online collaborative tool, offers distinct advantages over its stand-alone predecessor TopoSAToR, the underlying database is still insufficiently developed to provide either a global picture of topoi in French literature (one of the original objectives of SAToR) or a rich enough variety of examples of any one topos. The future of SAToRBase will depend first and foremost on the will of SAToRians to continue contributing to a shared resource. Any improvement to the interface and functioning of SAToRBase will be essentially futile without the addition of significant amounts of data.
In order to significantly increase the number of occurrences in the database, SAToRCanada has recently begun exploring the possibility of using the computer to identify occurrences semi-automatically. Theoretically, the computer would identify “hot-spots” of a text that may be occurrences of a topos by analysing it for a variety of attributes (ontology of nouns, semantic traits, verb tenses, etc.). The human would then be able to confirm (or reject) the existence of a proposed occurrence by selecting one of the suggested topoi from a list. The challenges in such an approach are numerous (syntactic parsing, semantic disambiguation, etc.), but the potential pay-off is enticing.
SAToRCanada is currently supported by a Regular Research Grant from the Social Sciences and Humanities Research Council of Canada.
The impetus for this article is, in part, to present some of the computing activities of SAToR, a primarily Francophone literary research group, to the wider humanities computing community. The description of SAToRBase seems particularly timely as recent articles in Humanities Computing journals have made reference to SAToR; see for instance Winder and Bradley. This article is an adaptation of a version in French (see Sinclair 2003a).
Some of these fields are explained in more detail in Section 2.4 “Submitting a New Occurrence”.
The status of an occurrence reflects the complex history of the SAToR. A mechanism was needed to distinguish data that had been imported from TopoSAToR, but that were in various stages of development, from the new, clean data that was being added to SAToRBase. The data from TopoSAToR are in fact rather uneven and it would require a considerable investment of time and effort to make them consistent. In any case, occurrences flagged as accredited, be they older ones from TopoSAToR or newer ones entered directly into SAToRBase, have been corrected and approved by a sub-committee of SAToR. Once an occurrence has been submitted by a researcher (Figure 7), the normal turn around time for accreditation is about two weeks. Occurrences are often modified by the committee but almost never outright rejected.
This minimalist principle could be called WYDSYDG (What You Don't See You Don't Get); my thanks to Barbara Bond for suggesting the term.
The terms “co-text” and “context”, defined here, are admittedly used somewhat idiosyncratically by SAToR.