This paper examines the planning stages of a large research database of historical materials, in particular, the task of determining the data structure, which will fundamentally determine the functionality of the resource when it is finished. The proposition is simple: before we begin imposing a structure on our material of study, we need to discover–in the Latinate sense of "uncovering"–the shape of the data as it presents itself in these materials. The imperative is stronger the more heterogeneous and idiosyncratic the material under consideration. Despite expressed doubts about the adequacy of the TEI (Text Encoding Initiative, http://www.tei-c.org/index.xml) to account for the relatively familiar object of the book, one can largely assume that a TEI-based schema will in most cases be sufficient for describing what one sees on the page of a book or manuscript. When one attempts to describe a miscellaneous collection of objects, such as those under consideration here, an adequate ontology becomes much more difficult to predict.
My argument picks up on Patricia Bart's discussion of what she calls "inductive" or "experimental" markup. In response to the inability of the TEI to deal with some features of an eccentric medieval manuscript of Piers Plowman (Huntington MS Hm 114) she was encoding, Bart proposed a set of TEI elements and attributes that would be explicitly experimental and used to create temporary solutions to cases and situations for which no application of the TEI guidelines currently exists. The process described below takes a somewhat different approach in the same vein: it is an experiment in pre-coding investigation, what I call "investigative tagging." For this experiment I devised a means for interrogating textual materials related to early modern cabinets of curiosities in order to determine how the object-world of these collections, as represented in diverse data types in the historical record, might be best described. The experiment presented a selection of scholars with a passage from a seventeenth-century catalogue of curiosities and asked them to represent in descriptive terms what they saw in these documentary representations of objects. The desired outcome was a clear indication of the sort of schema that would be required in tagging these objects so as to account for their salient material and circumstantial qualities.
The larger project: the Digital Ark
The database under consideration is The Digital Ark, a research database of representations of cabinets of curiosities in England and Scotland from 1580 to roughly 1700. These collectionsthe precursors of the modern museum–were heterogeneous, typically including a variety of strange and unusual objects, whether natural or artificial, domestic or foreign. They reflected interests in antiquity, natural history, ethnography, aesthetics, and more. They might include strange creatures pulled by fishermen from the sea, cultural artefacts brought by merchant explorers from the New World, oddities of human anatomy, or remains recovered from an Anglo Saxon burial moundanything that someone might find curious and interesting. The early modern cabinet of curiosities was thus a microcosm of a world that was continually revealing new and startling objects for consideration. Newly discovered specimens and object from far-flung regions of the world, together with more local anomalies and peculiarities, were at the heart of a rapidly revising ontology. Thus, at this time of rapid geo-political expansion, the received ontology was being radically challenged: what sort of things exist in the world was no longer certain or clear (see Impey and MacGregor 1985; Pomian 1990; Daston and Park 1998; Zytaruk 2011; and Moncond'huy 2013). At the same time, new ways of thinking were influencing this process of ontological revisioning. I tentatively and heuristically refer to this process as taxonomy, not the full-grown method perfected by Linnaeus, but, more generally, a process of placing things into categories of existence based on relationships inferred through the accumulation of knowledgein essence, Francis Bacon's method of induction as set down in his Novum Organum (1620). Taxonomy in this context is a reworking of object classifications based on relationships discovered by empirical observation: thus the importance of the cabinet of curiosities as an observatory that presents a model of this world of objects in their classifications and relations.
The cabinet of curiosities was thus an important site in the development of the science of classification (Burke 2000, 106-9). Consider for example the unicorn. Based on observation and accumulation of material evidence, there was in the seventeenth-century a new imperative to reconsider the ontological status of this creature. This process of reconsideration involved rudimentary taxonomy, beginning with a distinction between land unicorn and sea unicorn based on the growing realisation that those long, spiralling horns seemed always to be found by the sea, and, as it turned out, came from a very strange looking "fish," i.e. whale. Further clarification of thought resulted from further distinctions within the land-unicorn category. There were a few unicorns to consider: the rhinoceros, for example, as distinct from that white horse-like animal that could only be caught by a virgin. This being the case, one might also place the Naso brevirostris in the category of "sea unicorn" (its common name today is indeed the "spotted unicorn fish"). Nehemiah Grew depicted this fish in his catalogue of the Royal Society's museum, where he also listed "the horn of a unicorn bird"the Anhima cornuta of Brasil. The result of this taxonomyan exercise in discovering new categories and relations between thingswas a new and tentative ontology, an evolving model of what exists in the world and how it all relates.
So far, I have been discussing ontology in the philosophical-scientific sense, but the cabinet of curiosities can also be thought of as an ontology in the computational sense as a representation of an objective domain: in this case, the entire known world of the early modern period. Thus, in the Digital Ark, what I am seeking to do is to represent the ontology of extant representations of the world as it was experienced in the sixteenth and seventeenth centuries. The primary material of this database consists of early modern documents that represent these collections of curiosities: catalogues, inventories, descriptions in travellers' diaries and correspondence. What is required, then, is to produce an ontology in order to represent the documents that represent the cabinets of curiosities that were themselves representations of the world. From the computational perspective, then, there are three layers of ontology, if we take an ontology to be a structured representation of a world of objects: the representation of an increasingly strange, rapidly expanding world in the microcosm of the seventeenth-century collection of curiosities; the representation of this microcosm, and the world it references, in the historical documents of the period; and the computer model devised to represent these documentary representations. The resulting model should function both as "a representation of something for the purpose of study," and as "a design for realizing something new," that is, as a means for creating new representations of the early modern collection of curiosities (McCarty 2005, 24). In terms of informatics, the ontology provides a "conceptual model" or "content theory" that helps to clarify the structure of knowledge presented in these documents (Chandrasckaran et al. 1999, 20 and passim).
The important point here is the distinction between process and product. In the new regime of learning in the seventeenth-century, taxonomy became a method for investigating and interrogating the world of being; but in the current discourse of digital humanities, ontology and taxonomy are often collapsed and the nature of the object of study effectively predetermined by a premature adoption of metadata standards and controlled vocabularies. It is important to make a clear distinction between these two stages of data processing and to give the investigative process due consideration. In the present case, to arrive at the end product that this project envisions, I want to respect the exploratory, tentative, and probative method set down by Bacon and employed (to widely varying degrees) in the culture of curiosity in the seventeenth-century: that is, to undertake a process of taxonomic exploration toward building an ontology of these texts.
Background and development of the Investigative Tagging exercise
The idea of "investigative tagging" began with a computer program called Glyphicus developed by Jeff Smith at the University of Saskatchewan as a tool for tagging the texts using a relational database. The database Glyphicus (which is no longer being developed) allows ad hoc tagging with overlapping hierarchies, which, given the heterogeneous nature of these historical materials, seemed like a nice combination of functionality. It quickly became apparent, however, that Glyphicus was perhaps too unstructured for processing these texts and that it might better serve a preparatory step of playing or experimenting with the tagging of texts. From this process of experimental play it occurred to us that it would be good practice to find what is there in the material before presuming to describe it, thus to avoid what the same Jeff Smith calls the fallacy of prescience. In this way, the tool in essence presented the opportunity to think differently about workflow and to question assumptions about the future data model of the Digital Ark.
Existing metadata standards for material objects (such as CIDOC) come from the museum domain and thus focus on the objects themselves and their lives in the museum (On the importance of classification standards, and the challenge posed in establishing and implementing them, see Bowker and Star 1999, 9-14; and Lampland and Star 2009, 3, 14-15, 22). Although some of the objects of early modern collections are extant, the bulk of my data are in the form of historical documents describing collected objects and their seventeenth-century context. Moreover, the concerns and interests of museum practitioners and humanities scholars of historical materials are very different. Scholars in the humanities, for example, are as concerned with the discourse involving objects and their context as they are with the objects themselves. It was clear, therefore, that the Digital Ark could not be based solely on existing metadata standards. We needed to make some attempt at thinking through the material for ourselves and to come up with our own solutions for representing it, and then see how well the results would square with existing standards. Or stated another way, we wanted to avoid the fallacy of prescience, of presuming to know in advance the shape of the data (Bart 2006 § 11, § 14). To presume that one's material will neatly fall into line with the available standards is to limit, in some sense, what can be seen (Lampland and Star 2009, 14). As Johanna Drucker puts it in the context of what she calls "speculative computing,""Once determined, a data structure or content model becomes a template restricting interpretation" (2004). At the same time, investigative tagging functions as what Geoffrey Bowker (1994) calls "infrastructure inversion," bringing to the fore the decisions, assumptions, and choices behind an adopted classification scheme (Lampland and Star 2009, 17, 21; Van House 2003, 55).
A significant challenge for the Digital Ark is that it is comprised mostly of early modern descriptions and depictions of now-lost objects and, in far fewer cases, of modern curatorial data of extant objects: it deals at once in both a textual and an extra-textual world. A similar problem was faced by The Museum Project for Norwegian Universities in attempting to reconcile document-based information about museum objects—legacy catalogues of Norwegian museumswith currently available metadata standards for describing collected objects (Jordal et al. 2004). Reconciling legacy records with modern standards is a challenge in cases where the records were produced by curators (where there is a certain amount of consistency in form and content) but much more difficult when the content is extracted from a pre-scientific catalogue, letter, or a traveller's diary: the kind of documents that comprise the content of the Digital Ark. The solution of The Museum Project for Norwegian Universities was to create a new XML schema (in the form of a DTD, as was the practice at that time) for marking-up these records. Because the Digital Ark is concerned with pre-modern historical documents, it seems important to use the TEI as far as it will go for describing a document itself and its content as it relates to objects, and then to look for solutions to deal with what remained to be recorded about the objects and their relationships.
Further, there is tension between the strictness required of standardised data and the need for some flexibility in describing a wide variety of objects represented in a wide variety of document types in a form that would make sense to scholars. Once it was known what a scholar of this material would see (and therefore might be looking for) in this material, it would then be possible to square the scholars' practices with currently available metadata standards, first the TEI for describing the documents and their references to people, places, and bibliographic objects, and then the standards available in the world of museum studies for describing the material properties of the objects and events involving them. At points where the shape of the material could not be traced using established standards, complementary and supplementary solutions would be sought. This is a slightly different approach than that described by Bart in dealing with her rogue manuscript, which she describes as
a combination of the rigorous application of PPEA and TEI standards wherever possible, with the shameless application of technical duct tape whenever the project hit upon a sort of data or analytical juncture that could not be handled, or not at all well handled, by current standards and available software. (Bart 2006 § 9)
Rather than stop leaks as they occurred, my intention was to assess the unique needs of the Digital Ark, as best I could, before fully committing to a particular implementation.
Conceiving the process
The investigative tagging process was in some aspects adapted from qualitative tagging used in the Social Science for marking-up interviews (more formally called Computer Assisted/Aided Qualitative Data Analysis, or CAQDAS). It also resembles the implementation of folksonomy in Steve Museum (www.steve.museum), but with more analytic capabilities. Qualitative tagging software would probably have been the most robust option for collecting and processing results, but it presented a significant barrier of access for our participants, who were spread across Europe and North America and would not have easy access to any software we might select, and then they would have to learn the software. A simpler solution was required, but one with data-collecting and processing capability, which Steve Museum as a platform does not allow. A more viable solution was Glyphicus, described above. This program has the virtue of enabling ad-hoc tagging of documents without some of the problems associated with XML mark-up and overlapping hierarchies because it uses a relational database as its back-end. Its chief limitations are that it requires some set-up and instruction for a new user, and it is not Web-based, so it would have to be installed locally. These were seen as barriers to involvement for some desired participants. A second option was to ask participants to use an XML editor to create their own elements on-the-fly. This option would require that the participant be familiar with XML and have access to an editor. Again, this would have been a barrier to participation. The most flexible and easily-accessible option, in the end, proved to be a very old technology. Jeff Smith, who was then the interim director of the Humanities and Fine Arts Digital Research Centre at the University of Saskatchewan, suggested using a "paper trial," that is, a pen-and-paper means for users to access the material and record their results. In the end, we gave participants the option of reporting the results on a paper print-out or in an Excel spreadsheet.
Our desired participants were of two types: scholars working in a field associated with early modern material culture, and scholars working in the digital humanities. The former presumably would understand the documentary forms and content of the project, while the latter would be accustomed to thinking about documents in terms of structured data. In many cases, our participants belonged to both categories, giving them a valuable combination of expertise with special insight into the intellectual and technical considerations in seeking a computational solution for this project's material; but equally valuable were early modern specialists with little or no special facility with digital tools, because in the end, the published results must be useful and easily accessible to any interested scholar. Twelve participants were recruited, all of them scholars with relevant expertise. Seven were scholars already familiar with seventeenth-century collections of curiosities. Eight (including six of the previous seven) were early modernists specialising in some aspect of material culture (Figure 3). Three of those with some expertise in early modern studies also had digital humanities expertise, and another three were digital humanities specialists with no expertise in early modern studies (Figure 4).
The participants were presented with a pdf document containing a 440-word passage from a catalogue of a seventeenth-century collection of curiosities, Ralph Thoresby's Musaeum Thoresbyanum, with each of the significant content words numbered 1-271 (Figure 5) and a spreadsheet (either paper or electronic) for recording the results of their exercise. Although Thoresby's catalogue was not published until 1715, as part of his Ducatus Leodiensis, the bulk of his collection was formed in the late seventeenth-century.
The directions for the participants were as follows:
The purpose of this trial is to determine the kinds of information a user of a database of collections of curiosities might be interested in. That is, I want to investigate what sorts of information a user would want me to capture for retrieval and analysis. This work will therefore influence the design of my final database. Here are your directions.
First, read the whole document to become acquainted with this material. It comes from a seventeenth-century catalogue of curiosities. Your task is to identify kinds of information that seem significant or interesting to you and to give each bit of information a name or category: think of this in terms of metadata, similar to the kind of information you would find in a bibliographic database (e.g. title, author, place of publication, etc.), except that the names or categories that you use will potentially be quite different for this material and can be whatever you want them to be. Once you have located a string of text that you would like to name or categorise, record the beginning number and end number of that string in the first two columns of the spreadsheet. Then give that bit of information a name (i.e. "tag") or, if you like, more than one name (five columns are provided, but you can add more if you feel you need to). Feel free to use any name more than once: repetition is good. The more strings of text you name, the better. Please note that these strings can be of any length (from a single word to several dozen), and these strings may overlap with other strings. A few samples are provided in the spreadsheet. Please limit yourself to one hour for this task.
Participants were given the option of hand-completing printouts of the spreadsheet as provided in the pdf, or keyboarding the results directly into a provided Excel spreadsheet (Figure 6). Thankfully, most participants chose the latter. The completed spreadsheets where compiled into a master spreadsheet and the hand-completed results were also entered.
After putting a few participants through the exercise I modified the process somewhat. I shortened the text (the first version had several entries from two collections totalling almost 1,000 words) and stipulated the timeframe as "one hour" rather than "little more than an hour." I also provided a few examples on the provided reporting sheet. These alterations were to reduce the burden of the exercise but also to improve the results. Reducing the length of the passage would reduce the amount of time participants would spend in reading and understanding the material and maximise the time spent in interpreting and reporting results. It would also increase the likelihood of participants focusing on the same parts of the passage and thus provide more overlapping results.
Preparation of files for word analysis
The next question was how to analyse the results. Keeping with our low-tech approach, we elected to produce a simple list of results and use TAPoR tools (http://www.tei-c.org/index.xml; and later, Voyant, http://voyant-tools.org/) to produce statistics of recurrence. The idea was that tracking recurrence would be an easy way to identify commonalities in the data structures created by the participants, and thus the relative importance of the reported categories. After compiling the spreadsheets from the participants, I stripped the table structure to produce word lists, using WordPerfect to search and replace residual encoding from the conversion from Excel to plain text. I then regularised and corrected spelling and removed punctuation and function words, so that, for example, "Description of object" became "object description" and "Scotland (? Yorke)" became "Scotland York." I did not in this or any case correct errors of fact (i.e. that York is not in Scotland, as this instance seems to suggest).
I produced two word lists: one that turned phrases into tokens by replacing spaces between words with "%," and another that retained spaces, breaking phrases into individual words. Thus in the first instance, "person name" became "person%name" and in the second it became "person" and "name." Because the objective was to identify commonalities, the latter list seemed the most promising because it was likely to produce the greater number of commonalities. Focusing now on the list of individual words, I also made some discretionary adjustments in cases where separation of words would be semantically misleading (e.g. "Old Testament" became one word, "OldTestament") or create redundancy: "record/entry" became just "record." Also in the interest of identifying commonalities, singular and plurals were reconciled, so "bones" (seven occurrences) became "bone" (five recurrences) so that in total there were twelve recurrences of "bone."
Results received from participants
Generally, the participants had little difficulty understanding the process, with a few exceptions. One participant did not understand the idea of using word numbers to identify the start and end of a string of text; two participants did not take a general view of the data, and instead focused exclusively on content relevant to their special research interests, resulting in idiosyncratic categories. For the most part, though, the mechanism's process and intent, as presented, were easily understood.
The quality of the results themselves varied. My expectation for this exercise was that it would produce useful metadata that would provide a representation of the shape of the data. I expected the user to devise named categories for describing strings of text, analogous to XML, to produce abstractions from the data. Some participants interpreted this intent and completed the exercise accordingly. Most of them produced a mixture of category-names and simple reporting, i.e. a repeating of a term as it was found in the text: for example, one simply reported "Yorke" (word 11 in the text) rather than identifying a category, such as "place"; another reported "deer" rather than providing a category, such as "mammal." In some cases, a single string was given a mixture of categories and reporting:
|76||86||Description of object||Calculus||Human|
In most cases these "reporting" results were idiosyncratic and low frequency (with the exception of "horn," which occurred nine times), and thus did not figure significantly in my analysis.
The results also varied in the fulsomeness of their reporting, from eighteen identified strings of text with forty categories attached, to thirty-seven identified strings of text with 107 categories attached. Some strings were given only one category, and others were given several (up to seven). Some of these categories were structured:
And others were unstructured:
Generally speaking, though, the high frequency results provided good metadata that enabled meaningful analysis of trends in the way participants viewed the data and, more importantly, insight into how the content represented in this sample text might square with available metadata standards.
Many of the results at the higher end of the frequency list map quite nicely (with, admittedly, some manipulation) onto elements and attributes already available in the TEI: for example, <perName>, <roleName>, <placeName>, <date> . Figure 7 groups these terms under potentially relevant TEI elements, indicated in the orange columns. The remaining terms, in the left-most column, are the high-frequency values that cannot be accommodated by the TEI.
Some of these compliant categories and values (in orange) would require a liberal interpretation of the TEI guidelines: for example, to represent descriptive and qualitative values such as "short" or "thick" using the <measure> element. Another element, <event>, poses some challenges for the TEI in the context of the Digital Ark, given the complex relationships involved in the exchange of objects in the early modern period, but it is conceptually possible to devise means for representing these relationships that would be compatible with both the TEI <event> class and the CIDOC CRM provisions for representing events (On the use of XML in coordination with CIDOC CMR see Jordal et. al. 2004) There are two chief limitations in the TEI for my purposes: the lack of distinction between events, acts, and activities; and the challenge of making complex and semantically defined linkages between people, places, and objects associated with any given act. Other data can be handled using available attributes: instances of a foreign "language" can be represented using @xml:lang; and @quantity could be applied to an element defining an object.
Of the remaining high frequency terms (those in the left-most column of Figure 6), most of the categories relate directly to the objects themselves and could conceivably be accommodated by an <object> module. A major structural feature of the material in the Digital Ark is the object as a unit of content, either as a reference to, a naming of, or a depiction of some collected item. At present there is no such module, but a proposal to the TEI from the Ontologies Special Interest Group for representation of material objects is under consideration.
Another solution for identifying objects, particularly in early modern catalogues, is to use the text-structure elements of the TEI. For example, formal catalogues of the period often correlate book structure with an ontology of objects:
PART. I. Of Animals.
Sect. 1. Of Humane Rarities.
Sect. 2. Of Quadrupeds.
Sect. 3. Of Serpents.
Sect. 4. Of Birds.
Sect. 5. Of Fishes.
Sect. 6. Of Shells.
Sect. 7. Of Insects.
PART. II. Of Plants.
Sect. 1. Of Trees.
Sect. 2. Of Shrubs and Arborescent Plants.
Sect. 3. Of Herbs.
Sect. 4. Of Mosses, Mushrooms, &c. Togegether with some Appendents to Plants.
Sect. 5. Of Sea Plants.
PART. III. Of Minerals.
Sect. 1. Of Stones.
Sect. 2. Of Metalls.
Sect. 3. Of Mineral Principles.
PART. IV. Of Artificial Matters.
Sect. 1. Of things relating to Chymistry, and to other Parts of Natural Philosophy.
Sect. 2. Of things relating to Mathematicks; and some Mechanicks.
Sect. 3. Chiefly, of Mechanicks.
Sect. 4. Of Coyns, and other matters relating to Antiquity. Appendix. Of some Plants, and other Particulars. Index. Of some Medicines. List. Of those who have contributed to this Musaeum.
(from Nehemiah Grew Musaeum Regalis (1681), taking only the top two levels of the book divisions.)
The structure of this document implies a pretty standard high-level ontology of the early modern cabinet of curiosities, with two top-level categories: natural and artificial (with a fuzzy third, hybrid category for objects that are, in some respect, both natural and artificial). Artefacts can be further broken into classes, although there is not much agreement about these categories. Natural objects, though, typically fall out as Grew represents them: animals; plants; minerals. In the case of Grew's catalogue, the two levels of <div> elements (parts and sections) with @type values corresponding to the main content of each would go a long way to representing the model in Figure 8. In the absence of such explicit document structure, the common attributes of @type and @subtype applied to an ˂object˃ element would be able to provide the top two levels of this ontology. Further distinctions, though, begin to push against the limits of a typical TEI treatment.
Of the few values that remain unaccounted for in Figure 7, some could be handled at a structural level, with <div> ("collection") or, potentially, in relation to an <object> element ("description"). Other categories require non-standard solutions. Information pertaining to the material qualities of objects—colour, texture, and materials—would almost certainly fall outside the scope of a TEI treatment of the <object> class and are in fact not adequately addressed in any single museum standard. CIDOC's CRM contains no provision for representing the particular qualities of objects, and the SPECTRUM schema (the U.K. Museum Document Standard) has some specified categories for object description—colour, material(s) of composition, and a catch-all general physical description—but they have no provision for subjective categories such as "texture". Based on the results of this exercise, the potential qualities users might find in an object description will be too diverse and varied to categorise in any structured, granular manner. Others categories are even more difficult to handle, such as statements on the "curiosity" or "rarity" of an object. "Provenance" might function similar to an <event> (i.e. a representation of some aspect of the act of exchange), although an account of an object's provenance isn't usually limited to a single event. Even more challenging are the more idiosyncratic results: the descriptive vocabularies represented in the bottom 80% of the list is so diverse and disparate as to call into doubt the viability of prescriptive encoding beyond the basic structure already available in the TEI and potentially available in the new <object> class. It is clear that potential users might have some very particular and specialised interests in material of the type that is contained in the Digital Ark and that no structured system will be able to accommodate these disparate interests. A more ad hoc solution would be required for such cases.
Implications for Digital Ark
The most important implication for my project arising out of this exercise is that there is indeed a general shape to the data that is recognisable to most people with some domain expertise, and much of it can be represented with some adaptation of current standards. It also appears that the TEI expanded to include an <object> class will be adequate for marking the essential elements in my data that cannot already be accommodated in a TEI-based schema. There is, however, also some user interest in specialised content related to early modern collections of curiosities that will not be so easy to accommodate with an existing (though modified) standard. The obvious solution here is a user-generated feature that would enable users to contribute to a database-supported folksonomy or semi-structured tagging in accordance (as much as possible) with established museum standards (On the need for ad hoc categorisation, notwithstanding the ideal of categorisation standards, see Bowker and Star 1999, 11). This folksonomy, too, could be structured to be compatible with the TEI "feature structure" (Text Encoding Initiative 2014. Much thanks to Martin Holmes who alerted me to this little-known feature of the TEI and its potential usefulness for this data model). Similarly, representations of <events> will require a way to link people, places, objects, and dates in a complex way that would be very cumbersome using XML. Again, the likely solution is a database to articulate these relationships. The precise implementation of these solutions, however, is a step beyond the "investigative tagging" process and into considerations of linked data, considerations that are beyond the scope of the present paper.
Conclusion about the usefulness of the study
At first blush, because the results of this exercise mapped quite well onto existing and available metadata standards, one might wonder about the value of this process of investigative tagging. It is nonetheless valuable, for a few reasons. First, even if it does not reveal unanticipated data structures (and in other cases it might), it still serves as confirmation of the suitability of the available standards and provides validation for the data standard(s) that a project consequently chooses to adopt. Second, even though the lion's share of the data is regular and predictable, there were some outlier responses that were surprising and that seemed to require some sort of solution. Third, following the second, this process asserted early in the project an orientation toward the imagined user (in the first instance, a scholar of early modern material culture, but potentially anyone with an interest in such material). Although the needs of every potential user cannot be anticipated, it is valuable to consider solutions that might enable users with special interests in the data to contribute metadata that speak to those interests. Fourth, and similar to the third, this process is valuable in putting process before product in forcing a project to lay bare any assumptions about the data and the final form they will take and to think deliberately about the nature of content in order to construct an appropriate approach to the material based on what is "found" there. That is, it guards against the fallacy of prescience.
Observations about setting up this study
Finally, here are a few thoughts about the process of preparing and articulating the investigative tagging exercise.
- In providing instructions and materials to participants, it is very difficult to strike a balance between, on the one hand, being non-directive enough to allow the subjects to see what they would naturally see in the document and to record it in a way that makes sense to them and, on the other, giving enough guidance so that they understand what is required of them and are able to perform the task with some sophistication and, ultimately, produce intelligible and useful results. Too much emphasis on the idea that subjects should tag what interests them led in some cases to some very domain-centred results: e.g. the historian of medicine who tagged only elements related to human anatomy. In the tension between wanting to give enough direction so that the participants complete the study correctly and yet not too much direction so that they are too prescribed in their approach to the data and thus constrained in what they see, it is better to err on the side of too much direction.
- Domain expertise significantly affects the ability and inclination of the subject to interpret the task and imagine how to think about the material. The best participants had either digital humanities expertise or special expertise in cabinets of curiosities. Specialists in other domains of the early modern period tended not to take into account the encyclopaedic scope of the content and tended to miss the complexity of the structures implied in the material. Digital humanists and musaeological specialists tended to do better at abstracting higher-level concepts and at thinking about the data in a structured way. Among domain specialists (i.e. those with a historical interest in the material) there is a greater tendency to focus on content rather than data structure and also to return idiosyncratic results. This is where one class of participant might have required more direction (to think in general terms), whereas digital humanities specialists required less direction because they are accustomed to thinking about information systems as distinct from content. Thus, it is difficult to find the right kind of test subject for such an exercise, or at least find the right mix of expertise across participants, and the right language to accommodate their professional tendencies in approaching the material.
- It is also difficult to compose a test document that is short enough to allow the test subjects to process all or most of it in a reasonable amount of time (in my case, one hour) yet diverse enough in the types of material and types of information so as to be somewhat representative of the diversity of the whole corpus. That is, there is tension between the amount of time one can reasonably expect a test subject to spend on the exercise, and the need for representativeness in the textual material and thoroughness in their processing of it. Less rather than more content is better, but this poses a challenge in finding a passage that has adequate diversity of content, in this case both natural and artificial objects, and some diversity of species and artefacts, names of people and places, and richness in object descriptions.
What this paper presents is a conservative form of speculative computing, one that involves the interpretive subject in open-ended exploration of the material, but one that ultimately aims to come to a fixed understanding of the structure and nature of the data under scrutiny. In common digital humanities practice, it is now expected that purpose-built interfaces for accessing scholarly materials will undergo a usability study after the fact. This study was undertaken with the intent of exploring the desirability of a pre-project assessment on the basis that some aspects of a project's ground-up conception will pre-determine the final product. While projects such as the one described here should conform as much as possible to established standards, they should also demonstrate awareness of the constructed and provisional nature of these standards. At the same time, an important outcome of investigative tagging is to lay bare the often invisible process of devising a classification system for mark-up of documentary materials.
AcknowledgementsThis research was enabled by a grant from the Social Sciences and Humanities Research Council of Canada.
 The CIDOC Conceptual Reference Model (CRM) provides definitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation.
Bacon, Francis. 1620. Novum Organum. London.
Bart, Patricia R. 2006. "Experimental Markup in a TEI-Conformant Setting." Digital Medievalist 1. 74 paragraphs.
Bowker, Geoffrey C. 1994. "Information Mythology and Infrastructure." In Information Acumen: The Understanding and Use of Knowledge in Modern Business, edited by Lisa Bud-Frierman, 231–47. London: Routledge.
Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and Its Consequences. Cambridge, MA: MIT Press.
Burke, Peter. 2000. A Social History of Knowledge: From Gutenberg to Diderot: Based on the First Series of Vonhoff Lectures Given at the University of Groningen(Netherlands). Malden, MA: Polity Press.
Chandrasckaran, B., John R. Josephson, and V. Richard Benjamins. 1999. "What Are Ontologies, and Why Do We Need Them?" IEEE Intelligent Systems. http://www.csee.umbc.edu/courses/771/papers/chandrasekaranetal99.pdf
CIDOC. 2013. "The CIDOC Conceptual Reference Model." CIDOC-CRM. http://www.cidoc-crm.org/.
Daston, Lorraine, and Katharine Park. 1998. Wonders and the Order of Nature, 1150-1750. New York: Zone Books.
Drucker, Johanna (and Bethany Nowviskie). 2004. "Speculative Computing: Aesthetic Provocations in Humanities Computing." In A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell. http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405103213/9781405103213.xml&chunk.id=ss1-4-10&toc.depth=1&toc.id=ss1-4-10&brand=default
Text Encoding Initiative. 2014. "Feature Structures." P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html.
Glyphicus: Exploratory Markup Text Editor. https://launchpad.net/glyphicus.
Grew, Nehemiah. 1681. Musaeum Regalis. London.
Impey, O. R., and Arthur MacGregor, eds. 1985. The Origins of Museums: The Cabinet of Curiosities in Sixteenth- and Seventeenth-century Europe. Oxford: Clarendon.
Jordal, Ellen et al. 2004. "From XML-tagged Acquisition Catalogues to an Event-based Relational Database." In Franco Niccolucci and Sorin Hermon, eds, "Beyond the Artefact– Digital Interpretation of the Past." Proceedings of CAA2004. Prato 13-17, April. Archaeolingua, Budapest.
Lampland, Martha, and Susan Leigh Star, ed. 2009. Standards and Their Stories: How Quantifying, Classifying, and Formalizing Practices Shape Everyday Life. Ithaca: Cornell University Press.
McCarty, Willard. 2005. Humanities Computing. Houndmills, Engl.: Palgrave Macmillan.
Moncond'huy, Dominique, et al., ed. 2013. La licorne et le bézoard: une histoire des cabinets de curiosités. Poitiers: Musée Sainte-Croix and Espace Mendès France.
Pomian, Krzysztof. 1990. Collectors and Curiosities: Paris and Venice 1500-1800. Cambridge, U.K.: Polity Press.
SPECTRUM 4.0. 2011. Alex Dawson and Susanna Hillhouse, eds. Collections Trust. http://www.collectionslink.org.uk/spectrum-standard.
Thoresby, Ralph. 1715. Ducatus Leodiensis. London.
Van House, Nancy A. 2003. "Science and Technology Studies and Information Studies." In Annual Review of Information Science and Technology 2004, edited by Blaise Cronin, 3–86. Medford, New Jersey: Information Today.
Worm, Olaus. 1655. Museum Wormianum. Leiden.
Zytaruk, Maria. 2011. "Cabinets of Curiosities and the Organization of Knowledge." University of Toronto Quarterly 80.1 (Winter): 1-23.