"Digital documents last forever—or five years, whichever comes first"
- Jeff Rothenberg (1995, 42)
The transition from print to digital poses a potential challenge for the preservation of and access to the materials that will be required for future historians of the book. Much of what could be assumed in the age of print will no longer pertain. In 2010, when I started to do research on consumer multimedia CD-ROMs of a decisive decade in digital (or "electronic," as it was then called) publishing in Germany, the 1990s, I encountered astonishingly little scholarly reflection about this issue (see Bläsi 2012). In particular I was confronted with a very concrete primary research problem, namely the problem that I could not get access to landmark products of the period. LexiROM, a German multi-work encyclopedia on CD-ROM first published in 1995, could not be made to run on my contemporary PC (LexiROM was a joint product of Microsoft and Germany's BI-Brockhaus and Langenscheidt publishers. See the German language website http://de.wikipedia.org/wiki/LexiROM). The same was true of a similar American product, Microsoft's Encarta 96, and indeed of all other digital publications of the period (see for example, the Microsoft Encarta Encyclopedia). I had to rely on an institution to grant me access to these products. Using the collection as well as the infrastructure of the German National Library (DNB) in Frankfurt, an inspection of LexiROM works like this: the patron requests the material for loan (in this case, a CD-ROM) in advance; if the responsible librarians confirm availability of both the material for loan and the software and hardware to run it (there is no guarantee), the patron can—typically on the following day—use the product within a suitable hardware and software environment provided by the DNB in its multimedia reading room. This gives us a picture of what we, in a best case scenario, might have to expect if we will want to do research on today's digital book production in, say, 2030.
In this chapter, I first position my efforts in the context of book history before introducing basic concepts—migration and emulation—as well as challenges to the long-term preservation of digital data in general. Taking the current context of research on past books as an analogy, I shall try to outline which kinds of digital material will be necessary in the future to do research on today's digital book production. Finally, I shall present the results of my exemplary studies to find out which precautions for the future availability of digital material have been taken by various institutions involved in the production of digital forms of books (for example publishers) as well as by public institutions (for example libraries), the mandate of which is to collect and secure future access to current books and unique artefacts, and what this could mean for future research.
This paper is not meant to be a comparative survey of measures proposed or implemented in different countries for the long-term preservation of certain digital materials intentionally or unintentionally originating from the processes of contemporary book production. Nor is it concerned with the more fundamental issue of data degradation in preserving digital materials. It is rather an attempt to relate current thinking about long-term preservation to the special requirements of future research on today's digital books in order to identify actual and potential problem areas and to outline our prospects for the future. Although I take Germany as my primary context, these same issues are relevant to other regional contexts as well.
There is an ample body of literature on the nature and delimitations of book history as a field of academic pursuit. Leslie Howsam (2006) summarizes this discussion and develops her own integrative view. According to this view, the study of the history and culture of the book can be divided into literary studies, bibliography, and history. Howsam says that these "are by no means exclusive; other disciplines (such as communication studies, geography, political science, and sociology) and interdisciplinarities (cultural, classical, and medieval studies, women's studies, American, British, Canadian, and a whole alphabet of 'area studies') may be mapped onto these three core approaches" (Howsam 2006, 72). Accordingly, a book is "simultaneously a written text, a material object, and a cultural transaction" (Howsam 2006, ix). Although Howsam's thoughts do not explicitly tackle matters of the digital book and the communication space around the book, she asserts that "[t]he book is not limited to print (it includes manuscripts and other written forms), or to the codex format (periodicals and electronic texts come under examination, as do scrolls and book rolls), or to material or literary culture," and states as one of her central conclusions "that from all three disciplinary perspectives the lesson of studies in book and print culture is that texts change, books are mutable, and readers make of books what they need" (Howsam 2006, x). In book history (more than in most flavours of literary studies), not only the "product," but also the production and distribution processes, as well as the communication space around the medium, are all the subject of the scholarly endeavour.
For our purposes, the object of study has to be defined in such a way as to cover all relevant medial manifestations, including digital ones. Against a background of "the book" as we would ordinarily use the term, we shall understand books as artefacts or immaterial phenomena for the transport of predominantly textual information (possibly complemented by images and other entities from different sign systems) with specific features (length, argumentative complexity, aesthetic "pretension"). On this side of operationalized definitions, Howsam's aphorism, based on E. P. Thompson's formulation referring to class, according to which "the book is not so much a category as a process: books happen; they happen to people who read, reproduce, disseminate, and compose them; and they happen to be significant," does the job (Howsam 2006, 5). It can be said of these artefacts that they fulfill certain functions, which can be seen as easing the burden for the individual or collective memory and encompassing public accessibility; from another perspective, the functions can be seen as entertainment, education, information, and edification (Rautenberg and Wetzel 2001). Digital books (the subset in focus here) are digital files or applications that convey content that is also distributed as printed books or the content of which has sufficiently close links (structure, players involved, brand) to a content type that is or has traditionally been distributed as printed books. A wide range of very different forms has evolved in the young history of digital books, from very early computer manuals on mainframe computer systems, to be read from green-gleaming terminals in the 1950s, through book-based multimedia CD-ROMs in the 1990s taken as realia and motivation for this chapter, all the way to current forms. Current digital books manifest themselves in forms from book-like HTML content on websites and PDFs to view or to download, all the way to ebooks (in the narrow sense of the word), i.e. digital book-like content in specialized formats (proprietary ones like Amazon's KF8 and open ones like EPUB) intended to be read on mobile devices, dedicated e-readers or universal smartphones and tablets. Ebooks can also be applications; if they contain multimedia content, they are often referred to as enhanced ebooks. Although current forms of digital books are not limited to e-books–notice phenomena such as Wikipedia–I shall nonetheless focus in the following on ebooks.
The considerations of this chapter can be seen as embedded in book history insofar as they dwell on the bibliography / material object strand (according to Howsam's categorization above; I prefer the more comprehensive concept of "book studies" to "book history," but for the historical perspective taken here the difference is not significant). Here the main objective is to examine the "material" requirements for future scholars of the book to study today's (i.e. 2015's) digital book communication. Such studies can be described as bibliographic, since bibliography is "the discipline whose concern is with the book as a material object..., where the emphasis is on the preservation and transmission of written text" (Howsam 2006, 13). The focus of this kind of study is—following Howsam—on documents as objects, and typical research problems including the history of a single book, its publishing history, and its place in book-trade history. As hinted at briefly above, it is ironic and seems like a paradox at first, that in the case of digital books the immateriality of its production and transmission processes and also of the product itself—the 0s and 1s, as it were—are the subjects of the materiality aspect of media studies. Like bibliographic work on printed books, this kind of bibliography requires knowledge and competences from other fields and domains. As bibliographic work on, for example, nineteenth-century printed books requires knowledge about papermaking or bookbinding, bibliographic work on digital forms of books also requires a multidisciplinary approach including especially—as we shall see in more detail below—knowledge pertaining to information science. In many respects, of course, the long-term preservation of digital books is a subset of the broader task to preserve digital files of all kinds, be they administrative or accounting spreadsheets, or digital images or videos from the cultural sphere. Since, however, many adequate insights into the production (and distribution) as well as the preservation of the strings of 1s and 0s that can be described as digital books can only be gained by considering their specific cultural and economic context, methodologies of information science have to be integrated into a further development of bibliography as a field of study.
That access to past digital books can be a problem is not yet considered particularly urgent in book studies. As time goes by, however, when for example the 1990s become ennobled as "historic" (or when book studies will finally manage to escape from their somewhat awkward historical bias in research and teaching), access surely will be seen as an urgent problem. But the problem has already been made very explicit in other domains. Henry M. Gladney mentions cases from political history in which the non-availability (as a superset of the non-accessibility) of digital files has been an obstacle to historical research—and, indeed, political-historical judgment. Gladney quotes a posting by Eduard Mark in a Michigan State University discussion group; Mark argues, "It will be impossible to write the history of recent diplomatic and military history as we have written about World War II and the early Cold War. Too many records are gone. ... History as we have known it is dying, and with it the public accountability of government and rational public administration. ... The federal system for maintaining records has in many agencies ... collapsed utterly" (as cited in Gladney 2010, 47). Gladney notes that Mark, "could not secure many basic records of the invasion [of Panama, to remove Manuel Noriega in 1989], because a number were electronic and had not to be kept" (Gladney 2010, 47). Writing the history of 1990s "electronic" publishing will require access to digital documents both as digital products and as traces of the activities of human endeavour, and therefore will be similarly difficult, and researching 2015's ebook output in 2030 could well be harder still.
As has been pointed out above, the long-term availability of the bit strings that make up digital books is a prerequisite for guaranteeing access to them as books in the future for research purposes, among others. The long-term preservation of digital data of course presupposes that a) the physical carrier of the digital information is maintained physically (and not lost or destroyed) and b) the information on the data-carrying medium is readable on a physical level, i.e. the 1s and 0s can be identified optically (for example, on a CD-ROM) or magnetically (for example, on a hard disk or on a floppy disc), respectively. The former is taken for granted here, and the latter is a concern of long-term preservation research, which shall not be tackled here. The main concern of this chapter is how certain bit strings that are accessible and readable on a technical level can be read, interpreted, and processed as digital books, once the applications originally dedicated to view them will no longer be executable on future computer systems for reasons of obsolescence.
The most obvious approach to long-term preservation is to maintain a so-called "hardware museum," that is, to keep a repository of hardware (with the period's operating software) needed to view or execute the corresponding files in the future when accessing with contemporary hardware and software is no longer possible. Returning to the example given in the introduction, I could of course still view Encarta 1996 today if I had kept my mid-nineties Pentium PC running Windows 95. This approach is not generalizable, however, insofar as it might work for a certain time and to a certain degree in a professional preservation context (as for example in a national library) but, given the pace of IT development, it certainly cannot work for the individual consumer. The variety of hardware and software technologies to be kept is wider than can possibly be handled even by a large institution in the long term, considering, for example, the increasing difficulty in acquiring parts for old hardware. These sobering considerations concerning this plausible-looking approach leaves the long-time preservation community with two basic approaches to secure future access to digital files, namely migration and emulation.
The idea of migration is to transform the digital files designed for hardware and software platforms that are becoming obsolete into file formats that can be viewed /executed on a later generation of hardware and software (a simple office context example for this would have been to store a Word 95 / 7.0 document on a PC running Windows 95 as a Word 97 / 8.0 document using the transformation software provided as part of a Word 95 update). Since transformation software typically runs on the "old" hardware and software, this process has to be repeated as technology develops. Certain features of the digital file, however, might not be viewable and/or executable on the "new" hardware and software; even if such changes or restrictions are small and incremental, they might accrue to an undesirable or even unacceptable level. For our context, this is relevant, for example, in cases where the main message, the "immaterial" text, can be kept by transformation but features of its appearance (its bibliographic codes, such as fonts, layout, etc.) get lost in the process. Borghoff et al. propose a pragmatic combination of keeping the original file (for example, for emulation, see below) and a migrated version: "[I]t may be worthwhile to derive (using some migration strategy) a vernacular copy that can be rendered directly with current rendition software running on current platforms. This more 'up-to-date' version can [be] archived along with the original for easier access" (Borghoff et al. 2006, 76).
Since in many cases it does not suffice that a migrated digital file is viewable on only one of the current hardware and software platforms, but should ideally be viewable on as many platforms as possible (possibly on more than the original product was designed for), the migration should be performed to a prevailing data format or de-facto standard. For documents, the current prevailing format is the Portable Document Format (PDF). The International Organization for Standardization has made a PDF subset—PDF/A—a standard in long-term preservation; however, PDF/A does not allow the embedding of audio and video, scripting, referencing of outside resources, and use of most compression algorithms (the current ISO standard (ISO 19005-2) version of PDF/A is PDF/A-2). This is an unacceptable limitation for the purposes of book history.
The idea of emulation is to either transform (i.e. reprogram) the viewing software so that it will run on a new setup, or, more often, develop new viewing software that behaves like the old one insofar as it can read the file and render it much the way the old software did (a simple office context example for that would have been to reprogram Word 95 / 7.0 so that it runs on Windows 97 and therefore can view Word 95 / 7.0 documents exactly as Word 95 / 7.0 did on Windows 95). Another alternative is it to develop and run an "application" software for the new hardware and software configuration that behaves like the old operating system; this makes it possible to run the old application (i.e. the viewing software and, theoretically, all other applications that ran on the old set-up). The advantage of this particular but complex approach is that the procedure (to develop an application software for the new system that behaves like the operating system of the old) can recursively be repeated as soon as the system (for which the emulating application-as-operating-system program was developed) itself becomes obsolete: just develop and insert a new program of this kind into what will eventually be a cascade of application programs emulating obsolete operating systems. This approach is called a layered emulation (see Borghoff et al. 2003).
With emulation as long-term preservation strategy, in principle there is nothing for an originator or responsible gatekeeper of/for digital books (files or applications) to do, since from the perspective of cultural heritage (rather than that of personal consumption), it can be seen as belonging to the realm of institutions mandated to ensure the long-term availability of the corresponding emulating software. This insight does not guarantee that corresponding measures are actually in place, however.
As we have seen, certain features of, for example, enhanced ebooks containing scripting and/or time-based media pose considerable challenges for migration. The case is similar for emulation. David Giaretta (2011) has classified such complicating features according to a series of dichotomies: rendered vs. non-rendered; passive vs. active; static vs. dynamic. In each case, the second term of the binary presents a significantly complicating factor. According to Giaretta, digital objects can be rendered or non-rendered. Non-rendered, the complicated case, means that they require explicit further processing or interpretation—rather than simple viewing—to make sense to the user. This can be the case, for example, with files from accounting applications: a file from such a system might only make sense if it is seen in context with other files on the same system, because otherwise it might not be known to what data-type a certain entry in a column of a table belongs: is "234.567" a sales numbers or a part identifier? Furthermore, digital files can be passive or active. If they are active (again, the complicated case), they might be applications, pieces of code that have to be executed, not just viewed. The long-term preservation of applications as an aim in itself is not the focus of this paper, but in certain cases the content is inseparable from the program code for viewing it, so that keeping the content necessarily means keeping and supporting the application incorporating this content. Enhanced ebooks realized as book apps (applications with book content geared to run on mobile devices) are a striking example for this case. These apps are very similar to the case taken as a point of departure for this chapter, LexiROM, where content was bound to proprietary viewing software. And finally, according to Giaretta, digital files can be either static (finished and locked) or dynamic, that is, the content cannot be considered finalized at any given moment. The latter is the case for most content offered on the Web, via especially social media such as blogs. It is, however, also the case if digital books to be downloaded can and do get corrected, improved, or otherwise changed over time.
Having addressed the two general approaches for the long-term preservation of digital data, it has to be said that a general strategic recommendation cannot be given at this stage. The central question, namely which general archiving strategy to follow, is still open. Nonetheless, there is a wealth of experience available from past long-term archiving projects in the IT industry. From these experiences, a few recommendations for migration can be deduced. First, proprietary formats, the details of which are kept secret and the development path of which is controlled by a commercial entity, should be avoided. Another recommendation is to use formats that can be processed in an automated manner since there might not be either time or resources for manual interventions in a future migration process. In any case, the number of formats used should be kept to a minimum (see Borghoff et al. 2003; Borghoff et al. 2006). While it is important to push for the development and use of appropriate standards, implementation decisions are ultimately left to the corresponding institutions. There is, however, a framework put forward by The Consultative Committee for Space Data Systems (CCSDS) outlining concepts, methods, and processes for long-term preservation of digital data. This framework, the current June 2012 version of which was published as ISO standard 14721:2012 in August 2012, sees long-term preservation not only as a technical challenge, but as an area involving "legal, industrial, scientific and cultural issues" (The Consultative Committee for Space Data Systems 2012, 1-3).
In the following, we shall see that technological concepts like migration and emulation as well as the CCSDS framework mentioned above do not seem to be utilized (at least not widely and thoroughly enough) in the case of ebooks and other digital-born materials of interest to humanities scholars. The reason for this lack might be insufficient awareness of the problem, lack of political will, or managerial ineffectiveness, or more likely the lack of resources (Giaretta 2011).
Since we cannot make definite statements about future approaches to books and literature, it is clear that the provision of literature-related material for research purposes has to be as theory-agnostic as possible. However, it can be taken for granted that the content of the book itself (typically text-based, possibly enhanced by images, and in case of digital publications, possibly also by audio, video, and interactive features) has to be available to future scholars of textual studies, and that therefore it must be preserved. This involves preserving the substrate: the medium that contains the content (the velum of a manuscript, or the paper and binding of a published book). In the case of digital publications, these can be declarative files in various formats and the means of rendering them. The substrates of manuscript and print ensure a fixed display. The ergonomic and aesthetic appearance of the content in most digital books, however, is increasingly less fixed: typographical parameters such as font or font size can often be set by the reader; the text might look different depending on the device and settings being used. This is not true, for example, for children's books, where an opposite trend has induced the development of ebook formats that preserve important "physical" features like the position of line-breaks across customer settings and devices. Developments like the former are, as Adriaan van der Weel argues earlier in this volume, already gradually moving our mindsets away from the typographical. Nonetheless, the default appearance as well as the parameterizing corridor (font, font size, etc.) at the disposal of contemporary readers should be identifiable and reproducible for future researchers. The reason for this is that the design decisions underlying these product features might have contributed to the construction of meaning by contemporary readers; from the perspective of textual scholarship, we can say that a literary work might set in motion not only one, but "two large signifying codes, the linguistic code (which we tend to privilege when we study language-based arts like novels and poetry) and the bibliographical code" (McGann 1991, 56). The bibliographical code of digital books, as mentioned, is constituted by a range of formatting decisions and options.
Since Matthew G. Kirschenbaum has described in great detail a highly interesting research case concerning a piece of digital-born literature, I shall take his account to illustrate the challenges and opportunities with respect to the "materiality" of digital communication as an opportunity to outline a bibliography of the digital (Kirschenbaum 2008). This example may serve as an argument to take the "materiality" of digital books more seriously. Following the transmission and also the research history of William Gibson's digital poem "Agrippa," Kirschenbaum suggests that the "materiality" of the communication does indeed matter and can help to illuminate digital works of art. In the example explicated by Kirschenbaum, the phenomenon inspected is less an element of the bibliographical code in a narrow sense, but rather a trace of the production and distribution history of the work of art. In this sense, the example is also connected to the section Unique artefacts below. It is nonetheless the task of a bibliography of the digital to tackle phenomena like this. The poem on a floppy-disc was originally part of a work of plastic art produced in an undocumented number of copies in 1992. The floppy-disc was said to have displayed the poem only once; it was erased after displaying. How the poem ended up on the Web shortly after its initial publication is part of the riddle around the poem which Kirschenbaum set out to solve. Kirschenbaum's central observation is that
[i]n the copies of "Agrippa" available on the web the poem's final lines sometimes appear like this: "laughing, in the mechanism ." The final dot is not a typo, nor is it an act of authorial punctuation. It is instead the material mark of the electronic mail software that was most likely used to transmit the ASCII version of the poem, as distinctive as the broken serif that allows a forensic document examiner to pinpoint a specific typewriter. The UNIX-based "mail" program uses a period alone on a new line to indicate the end of a file .... MindVox [a service by an early US Internet service provider of the same name] was a UNIX-based board (Kirschenbaum 2008, 231).
He concludes from this that this forty-sixth character of the ASCII alphabet presents itself a decade after the initial transmission from disc to Web making the mechanism of the transmission transparent. For Kirschenbaum "Agrippa" is perhaps the best example
of the capacity of a digital object to take on and accumulate a material, indexical layer of associations, and for that reason, as much as the poetry, it continues to compel. But that capacity is unequivocally present in all digital objects and shows how and why this matter has been the purpose of this book—to cultivate, in Gibson's words, an "awareness of the mechanism" (Kirschenbaum 2008, 231).
The case illustrated by Kirschenbaum above clearly illustrates the importance of preserving not only the content of digital books, but also the digital equivalent of the bibliographic code that instantiates it. Alan Galey is currently doing interesting research in this area (see Galey 2015).
How much of the digital traces of a product should or must be accessible in the future is not similarly clear. A possible approach to define this is to take an appropriate model of book communication, deduce from this the traces that were considered worth keeping in times of the printed book, try to transfer those to the actualities of the digital age, and complete them, if necessary, on the basis of possible additional considerations. Given the product-centered view of this paper, the author, publisher, and printer stages are of interest: these are obvious, and can also be found in accepted scholarly models, as for example by Darnton (1982) (a humanities scholar who implies that the stages of the model are universal–and form a circuit by the fact that the reader (one of the stages not considered in the product-centered view) can be again an author) and Janello (2010) (a scholar of media economics and management, takes this model to argue that the system of the stages is in a transition from a (value) chain to a (value) network, with new players, "shortcuts" and "detours").
Guided by these stages, I consulted the archival material of a German travel guide publisher which had been conveyed to the Mainz Publishers' Archive. At the archive I was able to identify material of the following types: readers' letters, cover designs, images, diapositives, such "objects" as restaurant visiting cards, postcards, bills, letters to and from authors and to and from service providers (as e.g. printers), lists of the titles on offer by the publishing house at the time, receipts, memos, memorandums for file, lists of possible reviewers of publications, and letters to and from possible licensees. All this material is connected to the aforementioned author, publisher, and printer stages. This confirms that just about everything that is left behind (apart from the final works of art themselves) by the stages of the book value chain in the processes leading to this work of art is a candidate for archiving. Unpublished earlier versions as well as manuscripts and edited manuscripts of the works of art can also be part of this material. We shall henceforth refer to this materials as "unique artefacts." Such material allows us to "break ... the spell of romantic hermeneutics by socializing the study of texts at the most radical level," as McGann puts it. "We must turn our attention to much more than the formal and linguistic features of poems or imaginative fictions. We must attend ... to ... bindings, book prices, page format, and all those textual phenomena usually regarded as (at best) peripheral" (McGann 1991, 12-13). And these aspects are often to be found in the sorts of unique artefacts listed above.
To round off the preliminary analogy of non-digital cases, we can say that researchers could and can get access to the published materials in bookstores (if the work of art in focus is still in print), in libraries of various types, and in archives. Unique artefacts are accessible in archives, that is, if the artefacts have been kept because they fell or fall in the scope of such an archive (for example of an author or a publishing house) and—always somewhat incidentally—actually ended up there. Notable German examples are the German Literature Archive in Marbach or the Mainz Publishers' Archive (of the Institute for Book Studies of Johannes Gutenberg University). Whether researchers succeed in gaining access to the material in particular cases depends on a whole range of conditions, but it is possible in principle.
Access to both published materials and unique artefacts is as important for digital materials as it is for paper. In Kirschenbaum's argument we see a transfer of the bibliography strand of Howsam's and McGann's accounts of the history of books to the history of digital books. This approach, says Kirschenbaum, "takes its cues not only from the digital edge, but also from fields like comparative media, bibliography, textual scholarship, and the history of the book, or ´book studies" (Kirschenbaum 2008, xiv). And, as Kirschenbaum quotes from Jonathan Rose, "We cannot effectively address questions of literary history or interpretation ... until we know how books (not texts) have been created and reproduced, how books have been disseminated and read, how books have been preserved and destroyed." Just as scholars of earlier periods of textuality had to come to grips with all the materials, mechanisms, and means of book production and dissemination, so we, in the case of digital books, will have to "follow the bits all the way down to the metal" (as cited in Kirschenbaum 2008, xiv-xv). To make this possible, the corresponding substrates have to be accessible.
Now we put ourselves in the position of someone in 2030, looking back on the digital literary / book production in 2015. I have identified a particular publisher, a library, and an archive as contexts for examining what we might expect to be the conditions for this imagined study of the history of digital publishing: Rowohlt, one of the leading and most renowned German trade publishers (mainly for the unique artefacts documenting the publishing process), the German National Library (for published materials), and the German Literature Archive (for unique artefacts). I will try to identify in these contexts some problem areas for a future scholarly investigation of today's book production.
The preservation of published materials faces a major obstacle in the diversity of file types, applications, and supports used to disseminate them. According to its current policy, the German National Library (DNB) only archives ebooks in PDF and EPUB formats, EPUB(3) being the marked XML-based format standard proposal, the widespread adoption of which would be an important contribution to an effective and efficient long-term preservation of ebooks. This was reported in a conversation between Natascha Schumann of the German National Library and the author on June, 5th, 2012. It is important to note here that the range of different ebook formats is identified as an even more immediate problem for contemporary ebook buyers and readers, as well as (from a retailing and cultural diversity perspective) for independent bricks and mortar bookstores (see Bläsi and Rothlauf 2013). As a consequence of this policy, the German National Library does not archive ebooks in AZW or KF8 (used by Amazon), for example, or in the form of apps. This means that the ebooks distributed by the second most important ebook retailer in Germany, Amazon (which distributes in its proprietary formats only), are presently not systematically archived by the German National Library. There is, it is true, a large intersection between the ebook contents offered across different distribution channels and formats; therefore, for many Amazon ebooks there might be an EPUB equivalent by another retailer which is archived according to the current business rules of the German National Library. To pursue certain research problems, for example pertaining to differences between ebook versions through different distribution channels, this might not suffice, however, and Amazon-only ebooks fall through the cracks in any case.
Another problem with respect to published materials is that for digital products presented for download (rather than distributed on a physical data carrier) it is easily possible for publishers to change the file on offer almost without effort and costs, be it to correct or improve or enhance it. Such undocumented changes of contents, sometimes called "silent releases," are used by publishers as a handy new option: "Our book apps are modified with respect to content or technology in the course of their product lifecycle—without anyone outside recognizing," admits Uwe Naumann, Chief-Editor Non-fiction Department of Rowohlt (Uwe Naumann of Rowohlt in a phone conversation with the author on May, 11th, 2012). The German National Library does not track updates and does not store such updates systematically. To pursue certain research problems, for example pertaining to different textual and technical variants over time, this is a problem, since the undocumented ones cannot be systematically tracked and recorded.
Another challenge is one of access on a legal and technical level. Some ebooks are still guarded by so called restrictive (or "hard") digital rights management (DRM) measures to protect the intellectual property. Most of these measures employ rather complex procedures in the course of which the legal owner (or, more precisely, legal licensee) of the file essentially gets an individual digital key from a server on the Web to unlock the encrypted content. However, this key only works on the licensee's current individual configuration, from which he sent the request. This means that in many cases the same content file, using the same key, cannot be viewed on another device or by another user. Since this restricts the usage options for libraries in an undue manner, the German National Library has a strict policy with respect to this. DNB's Natascha Schumann says, "We do not archive products with hard DRM, since we might not be able to access these products, for example, when the license server has ceased to exist". Owing to the problems caused by the variety of ebook formats and the use of restrictive DRM measures, only ebooks in EPUB and PDF formats (in particular, no book apps) without restrictive DRM measures are systematically archived by the German National Library, as mentioned. This represents a more or less arbitrary subset of the current German-language ebook production, an incomplete record that is missing not only certain formats and versions of more common ebook titles, but in some cases ebook titles altogether.
Whose task is it to ensure that today's ebooks can be accessed in a more distant future? For Germany, the answer in principle is the German National Library, as mandated by federal law. The situation is similar for many other national contexts, with dedicated libraries and repositories. The German law was amended in 2006 to include "incorporeal editions" of books, including those published on the Web (see http://www.dnb.de/DE/Wir/Sammelauftrag/sammelauftrag_node.html). As we have seen, the application of the law in DNB policy limits considerably the scope of what is actually included. For materials that are covered, the German National Library has established a complex, yet efficient accession process that includes a correctness check, the extraction of technical metadata (version, etc.), and a DRM check (to make sure there are no hard DRM restrictions) (as told by Schumann). For the archiving of these ebooks, the German National Library enters into contracts with the individual publishers, but for efficiency reasons it typically receives the actual files from the aggregators (intermediaries) that also supply ebook stores with ebooks from different publishers (as told by Schumann). So, for the library side of the German case, the legal framework and archiving processes are in place, though their scope of application is limited.
And the publishers? It could be a policy of publishers to keep track of their ebook production, possibly for their own purposes. However, in the case of Rowohlt—based on statements by Uwe Naumann—I can say that for its ambitious "Digitalbuch plus" series of non-fiction book apps, for example, there are no systematic provisions taken whatsoever. Apart from the finished products on servers for download (for a probably limited amount of time), files are treated just like unique artefacts in connection with their production (see below), i.e. they are deleted as soon as the (production) project is considered to be finished.
The fact that most of the current ebooks on offer are more or less unchanged editions of content which had been published or is being published in parallel as printed versions, and will therefore be archived following well-established processes applied to printed books, might be an aberration in this transition. This compromise might be considered an acceptable fallback for philologists or literary scholars (in Howsam's terminology), but certainly not for historians of the book, bibliographers, scholarly editors, or, more generally, for scholars of book and media studies who are mainly or also interested in determinants of the transmission – or indeed in the multimedia enhancements of the increasing number of enhanced ebooks.
The variety of file formats, which has been identified as one of the basic problems for the long-term preservation of digitally published material, is much larger in the case of unique artefacts, since the scope is not a certain subset of formats, namely ebook formats, but a potentially unlimited number of data forms of an almost inestimable range of application programs, from word processing and personal address management all the way to accounting software. And the documents are either the result of individual data processing not in the scope of any kind of systematic information management or the result of corporate data processing, the primary aim of which is typically not interoperability. It is true that there are in fact a few de-facto standards, but even those have changed considerably over time (for example, between MS Word .doc and .docx). Similar to the case of published materials, there can also be applications, such as the applications programmed by Friedrich A. Kittler, a German media theorist (Heinz Kramski of the German Literary Archive, in a phone conversation with the author on May, 22nd, 2012). Kittler's applications, which are among the materials now in the care of the German Literary Archive, only run on the respective configurations used by Kittler himself. The exchange of letters between representatives of the different links in the publication chain of the traditional book industry have been an important source for extra-textual approaches to literary works for centuries, especially in the case of correspondence between author and publishing house. Its most prevalent current equivalent, the exchange of emails, poses specific additional challenges: again, there is not only a range of formats over different generations of technology, but the relevant content can, in the case of application services provided by, for example, Google or Yahoo, be stored in the "cloud," which may not provide options to export it in a systematic manner. Moreover, the correspondence is typically inseparably entangled with other discourses, possibly irrelevant or in need of protection for well-understood privacy reasons. This is not only a costly and sensitive problem to be managed by the researchers on a domain level, but a challenge because of the high level of data security provided by most email applications (which can be seen as a kind of DRM). Provenance issues are particularly concerning in the case of digital offline documents. Individual handwriting, traditionally an important aid in the attribution of authorship, is no longer available, and digital analogies are not yet common. In the case of the German author Thomas Strittmatter (1961-1995), whose pre-death estate had been handed over to the German Literature Archive as one of the first containing digital material, it turned out that some of the (on the surface) indistinguishable text documents on his computer were in fact written by various lovers to whom he had given access to his computer (as told by Kramski). Corporate IT systems of course also support non-content-related business processes; this is increasingly common also in small enterprises as well. The documents produced by such systems—which can also be part of the relevant digital unique artefacts—can be "nonrendered," (see above) which means that the simple viewing of them does not give sufficient information to the researcher. Especially in publishing houses, these documents can be files from application systems like enterprise resource management (ERP) systems or from systems for project management, process management, or error reporting. If these systems were not used widely at the time or were even thoroughly proprietary to the company in question, this makes the prospects for future research even more complicated.
As of yet the German Literary Archive has not often been confronted with the challenge of handling unique digital artefacts. Apparently, however, the simple-looking case of archiving a few floppy discs with files from the pre-death estate of Thomas Strittmatter, authored with standard software, was a technological, conceptual, and organizational challenge (as told by Kramski). Some of the predictable cases mentioned in the previous paragraph (applications, emails, "non-rendered" documents) will be much more demanding.
So then, who is responsible to ensure that digital unique artefacts as (mostly) unintentional traces of today's (e)book production can be accessed also in the more distant future? It is, first and foremost, the task of archives. The German Literary Archive, which, on a mission level, has accepted the transformation of its role in the digital age, has not yet implemented systematic processes and has only a few authors' hard drives and floppy discs as precedent cases up to now. Many other archives are not even that advanced.
Again, what about the publishers? In the case of Rowohlt I can say—again based on statements by Uwe Naumann—that there are no systematic provisions whatsoever. On the contrary, it seems to be normal practice to destroy material documenting publisher-author interaction as well as internal documents once the project is finished and the product has been shipped. Naumann confesses,
[a]t the end of a project, I usually delete the corresponding mail folder—including the attachments of these mails, of course: reports of content errors and software "bugs" ... Raw material such as film clips and other media elements as well as early versions of book apps I get from service providers on DVDs or via a file server, I store them on my desktop PC. At the end of the project, I usually delete them and junk the DVDs.
Whether Naumann's unfeigned statement that "[i]f we are talking about potential world literature, I would of course not do this" can be taken as comfort here is hard to say. Moreover, this purging procedure does not of course apply to the kind of records that have to be kept as a legal obligation, such as invoices, contracts, etc., but accessing them can still imply some of the complications mentioned above: for example, they might be "non-rendered."
Some material connected to digital products, mostly for marketing purposes, is published on the Web. Beyond the fact that—in spite of the relatively comprehensive mandate of the German National Library—the Web, not even the German-language Web, can never be archived exhaustively, these Web offerings can include dynamic parts, such as active author blogs, which adds to the difficulties. There are, to a dwindling degree in this time of transition, still also physical letters, printouts of digital documents, documents of still (partly) print-based processes, print material as used for example for the sales representatives' conferences and for advance information of the booksellers, etc. This kind of material is subject to slightly more established, if still often accidental, procedures as mentioned above. For individual authors, other artists, and service providers involved in the production of ebooks, it is not possible to generalize about their long-term preservation behaviour.
As soon as we step beyond the traditional publication system, more vexing challenges arise. Self-published ebooks completely lack institutional support for long-term preservation. The German National Library archives such works at best incidentally, as a collateral effect of its Web archiving, or in rare cases in response to the initiative of authors or self-publishing platforms. Large parts of academic publishing, especially in the domains of science, technology, and medicine, are distributed to (mostly institutional) customers in the form of access to databases with journal articles and book chapters. These products do not fit in the traditional, legislated mandate of the national library. Although some of the publishers pledge that access to the publications for institutions and persons having subscribed to them will be guaranteed beyond the existence of the company, questions remain. From the perspective of the DNB and the German Resource Council (DFG), which has taken this up as a strategic problem, the long-term availability of these academic publications is an essentially unresolved case, which has consequences not only for bibliographic research on scholarly publishing as an expression of culture, but also for discourses internal to these fields concerning the history of disciplines, including the attribution of the primacy of findings. However, studies concerning a central public hosting of these databases for the German higher education and research market, including the aspect of long-term archiving, are in process.
Finally, digital book forms that fall outside the prototypical value chain for trade books are works of electronic literature in the sense expressed by Katherine Hayles. Hayles names "hypertext fiction, network fiction, interactive fiction, locative narratives, installation pieces, 'codework,' generative art, and the Flash poem" as a few examples of what might be considered "electronic literature" (Hayles 2008, 30). They are different in that they are typically not traded and the rendering technology is typically part of the work of art, which makes them similar to book apps in this respect. Moreover, these publications are typically not issued an ISBN and can therefore not even be traced systematically to confirm their existence. From the perspective of the DNB, such online material might incidentally be archived as part of the general (but selective) Web harvesting; systematically, they do not fall under the scope of the collection mandate of the DNB.
Given the current state of preservation, future textual studies on today's digital book production will be challenging and may require considerable preparative work involving data archeology, or in some cases it may just not be possible. If current developments, especially the increasing variety of formats and particularly the variety of formats for enhanced ebooks, prevail as is foreseeable, and if the cultural heritage governance and administration (somewhat understandably) refuse to keep up with proprietary formats and rapid update cycles, the prospects for the future research on today's ebooks shall remain gloomy. This is especially the case if an agreement concerning a DRM-free delivery of ebook copies to central libraries cannot be reached and the silent release problem is not tackled, for example by the compulsory use of digital signatures that disclose the identity or non-identity of files offered for download at different points in time. Beyond these particular problems, the availability and readability of today's ebooks in the future is of course a sub-problem of long-term preservation in general.
On the other hand, there are reasons for confidence that there will be beneficial disruptions to this scenario. First, standards in publishing, especially when it comes to ebook formats, are being promoted within the book industry. Second, strategies for an improved and more comprehensive long-term preservation for digital books (and book-related digital data) by national and supra-national entities are being discussed to address the deficiencies mentioned above. Developments in policy and budget do take time, and while this delay entails the risk that some works and traces of the current digital book production will sink into oblivion, the overall picture does give some reason for hope. Another reason for confidence is that we can anticipate increasing awareness of these problems among authors and in particular publishing houses. Since legislative measures referring to long-term preservation will most probably not oblige them to change their procedures, it is important that corresponding concepts, formats, etc. in the public sector (libraries, archives, etc.) are actively propagated to publishing houses and authors by those institutions and bodies.
Bläsi, Christoph. 2012. "Verlagsarchivalische Materialien und die Geschichte des »elektronischen« Publizierens. ein exemplarischer Vorstoß." In Verlagsgeschichtsschreibung. Modelle und Archivfunde, edited by Corinna Norrick and Ute Schneider, 233-252, Wiesbaden: Harrassowitz.
Bläsi, Christoph and Franz Rothlauf. 2013. On the interoperability of eBook formats. Brussels: European and International Booksellers Federation. http://www.europeanbooksellers.eu/wp-content/uploads/2015/02/interoperability_ebooks_formats.pdf.
Borghoff, Uwe M., Peter Rödig, Jan Scheffczyk, and Lothar Schmitz. 2003. Langzeitarchivierung: Methoden zur Erhaltung digitaler Dokumente. Heidelberg: dpunkt Verlag.
Borghoff, Uwe M., Peter Rödig, Jan Scheffczyk, and Lothar Schmitz. 2006. Long-term preservation of digital documents: Principles and practices. Heidelberg: Springer.
The Consultative Committee for Space Data Systems. 2012. Reference model for an open archival information system (OAIS). http://public.ccsds.org/publications/archive/650x0m2.pdf.
Darnton, Robert. 1982. "What is the history of the book?" Daedalus 111.3: 65-83.
Galey, Alan. 2015. "Bibliography beyond books: Digital artifacts, bibliographical methods, and the challenge of 'non-book texts'." Paper presented at SHARP 2015 The generation and regeneration of books, Montreal, Quebec, July 7-10.
Giaretta, David. 2011. Advanced digital preservation. Berlin: Springer.
Gladney, Henry. 2010. Preserving digital information. Heidelberg: Springer.
Hayles, Katherine. 2008. Electronic literature: New horizons for the literary. Notre Dame: Notre Dame Press.
Howsam, Leslie. 2006. Old books and new histories: An orientation to studies in book and print culture. Toronto: University of Toronto Press.
Janello, Christoph. 2010. Wertschöpfung im digitalisierten Buchmarkt. Wiesbaden: Gabler.
Kirschenbaum, Matthew G. 2008. Mechanisms: New media and the forensic imagination. Cambridge: MIT Press.
McGann, Jerome J. 1991. The textual condition. Princeton, NJ: Princeton University Press.
Rautenberg, Ursula and Dirk Wetzel. 2001. Buch. Tübingen Niemeyer.
Rothenberg, Jeff. 1995. "Ensuring the longevity of digital documents." Scientific American 272. 1: 42-47.