Preserving Authentic Digital Journals: The Anthrosource Project

Preliminary Version

The American Anthropological Association (AAA) aims to build a digital portal which will consist mainly of journal content. Their goals are to improve access, improve the ability to retrieve anthropological resources, reduce the cost of these resources, and to preserve and archive them as digital materials. They plan to achieve these goals through the retrospective conversion of ten of their sponsored journals as well as to begin current electronic production of these same journals. The preservation and archiving element of the project is the focus of this essay, with emphasis on preservation of the authenticity of the published journals. Many preservation projects only discuss the maintenance of authenticity of the digital objects in their care; the creator is assumed to be responsible for the original establishment of authenticity. In this case, however, the American Anthropological Association is both creator and preserving entity. They must be concerned about both aspects.

There is a consensus that “a record is authentic when it is the document that it claims to be”. The practice of establishing authenticity, therefore, deals with specific claims made within or about the document rather than global ones. In other words, the question would be “did K write this?” rather than “who wrote this?” Authenticity is also not an absolute term; the authenticity of the object depends on the research being conducted. For a historian interested in the text of an analog document, an official and authorized high quality copy may be sufficient authentication; however, a book and paper researcher will only be satisfied with the analog original. Ken Thibodeau says that, “the criteria for authenticity depend on the intended use of the object. You can only say something is authentic with respect to some standard or criterion or model for what X is.” David Bearman articulates this idea in regard to digital surrogates when he states that “The issues of authenticity in digital transformations are, properly, about how well certain representations serve a desired purpose.”

Any definition of authenticity for digital objects must depend, therefore, on a decision about the desired purpose which these digital objects will fill. As AAA wishes to fill the needs of their members by serving them digital journal content, the final establishment of the authenticity of their production will depend on the needs and purposes of the members of the AAA. This may necessitate changes to the strategies employed once the results of the user study being conducted by Bonnie Nardi are available. It is possible that the AAA members may wish for different methods of representation of the journal content, as well as more, less or simply different types of authenticity documentation.

In the print world, the archival institution, or preserving repository, plays a major role in authenticity judgments; often a researcher will not question a document which is housed within a trusted repository. This is the “power of archives to authenticate”. For Peter Hirtle, this power to authenticate is derived from two basic strategies: the maintenance of unbroken provenance and diplomatic methods. The maintenance of unbroken provenance, or custody, in the analog world is based on a presumption that a certain body of records was received by the archives from a legitimate source and has been maintained in a stable form within the repository. Diplomatics, on the other hand, has an emphasis on the individual record. It aims to establish the authenticity and reliability of the records through a measurement of the completeness of the record and control over the creation of the record. There are rules and methods for determining this in the print world. These include several attributes which help to identify the record, the minimum of which are the date and the responsible party. A higher degree of completeness with more identifying attributes present increases the assumed authenticity of the document. Physical signs of authenticity embedded within the physical object also exist, including paper and ink of the appropriate era and sometimes handwriting that could indicate the authorship of a paper document.

Establishing the authenticity of digital objects is a more difficult task due to the mutability and ease of manipulation of digital objects. Since it is so easy to change a digital document without an obvious sign of the change, unbroken custody of the document no longer serves as a reliable indication of the document’s authenticity. Holding a digital object in a stable physical state is no longer an effective means of preservation either; the object must often be migrated in order to remain reproducible. Diplomatics has often run into difficulties in dealing with digital records because much of the same information that is included in analog documents, either within the document itself or through context, is not being consciously created for their digital equivalents. In their case studies, the US InterPARES Preservation Task Force discovered very little consistency in the way in which attributes which identify the record are associated and expressed, with some being embedded in the record, some encoded in metadata, and some implicit through context. This lack of consistency increases the difficulty in establishing procedures for determining authenticity. Physical signs of authenticity are also missing: the ink and paper no longer exist, and it is often impossible to glimpse the technical details of the construction of the digital document. This all necessitates a translation of the existing concept of authenticity to digital objects, including a “new infrastructure” of technical and organizational methods.

These qualities of the digital indicate the need for transparency for digital documents and the conscious creation of the level of documentation necessary to establish the document’s authenticity. This is important especially for the retrospectively converted journals, as they will be digital surrogates for print rather than born digital publications. “If sources are studied in the surrogate…all questions concerning the authenticity of the original are overlaid with additional questions about the methods of representation”. Different methods of digital capture can increase or decrease emphasis on parts of the document or represent color with differing degrees of accuracy. Knowledge of the format of the digital object may also aid in an establishment of authenticity: a TIFF may be more trusted than a JPEG, a lossy compression. Likewise, an open standard may be more trusted than a proprietary or encrypted standard.

For both the born digital and the retrospectively converted content, the creation of appropriate metadata will go a long way towards establishing authenticity. This includes many of the “signs” of authenticity used in diplomatics. The US InterPARES group has developed a template for “benchmark requirements supporting the presumption of authenticity of electronic records.” While these are directed at records used as legal evidence, they can be applied to the authenticity and preservation model for archiving electronic journals. The first set of requirements is defined as those that are an “expression of record attributes and linkage to record” and aim to define the identity of the record and establish the integrity of the record. The identity of the record is established through the documentation of the “persons concurring in the formation of the record”, dates of creation, action, and indications of related materials. Establishing the integrity of the record includes such items as the name of the handling office, office of responsibility, indication of annotations, and indications of technical modifications. These elements in this set of requirements could easily be adapted to the electronic journal context and should be. They should be recorded in the appropriate metadata fields in order to provide the users with information regarding these resources. Verifying that all of these elements are present would be a preliminary method for determining the authenticity of the electronic journals.

Among the rest of the requirements, the delineation of the policies protecting against loss and corruption of records and media and technology obsolescence is of note. In addition, the requirement to establish an “authoritative record” is especially applicable to the AAA project. InterPARES defines this property as identifying the office responsible for maintaining the “record copy.” For AAA, however, this will entail defining which of the files (out of the HTML, PDF, TIFF, or XML-DTD) maintained will be the “archival copy.” This, however, will be more fully covered below. Knowledge of all of these aspects of the methods of representation and maintenance in conjunction with knowledge of their own needs will aid a researcher in deciding whether a document is authentic for their specific purposes.

Once a digital object has been established as authentic, another problem arises. This is the problem of digital preservation. The quick pace of technological change means that software and hardware used to create and display a digital document may become obsolete within three years. Since digital objects are hardware and software dependent, this creates a problem for their representation and display. In most cases, including the Anthrosource project, migration is the preservation strategy of choice. This preserves the digital document as a logically coherent entity, rather than one of physical integrity. In The State of Digital Preservation, Ken Thibodeau argues that “it is impossible to preserve a digital document as a physical object. One can only preserve the ability to reproduce the document…. The preservation of an information object in digital form is complete only when the object is successfully output.” Since this changes the document in a fundamental and physical way, this changes authenticity requirements and definitions. No longer can an archival repository simply maintain the authenticity of a document by keeping it stable and unchanged; methods of maintaining authenticity of the object while actively changing it must be established.

This brings back the concept of the “desired purpose” of the digital object. What was it produced for? What needs does it meet? What elements of the original are essential to its preservation? If the locus of authenticity inquiries center on the rendering of the object on a computer monitor, then some standard of the “look and feel” before the migration must be maintained as a benchmark. If the locus is the intellectual content, then it must be verifiable that this has not been changed in the process of migration. Within the confines of this essay, it is assumed that the intellectual content is the most important aspect for preservation. The look and feel of the journal is important and should be preserved as long as possible, but the intellectual content is the “essential attribute” of the digital document.

The question of where the boundaries of the intellectual content of the journals lie is one which must be answered in regard to preservation choices and authenticity establishment. The report on the Harvard Mellon planning grant breaks down electronic journals into various components and defines which of these are within the preservation scope and which are not. Their model of journal components will be used here. Without a complete knowledge of the manner in which these will be published through Anthrosource, definitive statements of the components which will be preserved and authenticated cannot be made; however, general recommendations and comments can be stated. The components which are assumed to make up the “intellectual content” are as follows: the “article proper”, author supplied references, links to external sources, abstracts, tables of contents, other editorial content, bibliographic descriptions, editorial boards, editors, copyright statements, editorial policies, reviewer lists, journal descriptions, and cover images from the corresponding print issues. As AAA is serving two roles as creator and preserving entity, they can ensure that this content is produced in normative file formats which are suitable for long term preservation, a luxury not always given to digital repositories.

Some of the components described by Harvard as being within their scope are not yet part of the AAA’s publication scheme, although they may someday be and therefore should be considered for policy creation. The two components under consideration for future policy creation are “supplementary material/enhanced content” and threaded discussions. The first of these is defined as material such as sound or video files, data sets, computer files or other related materials deposited with the publisher instead of being linked to on other sites. These are not part of the “article proper” and author created. This leads to the possibility that they will not be in normative file formats which Anthrosource is equipped to preserve indefinitely. While AAA is still in the process of transitioning from print as their primary publication, this kind of material is not likely to be an integral part of the journals. As their Internet presence becomes more established and the electronic publishing component becomes more prominent, this may become a regular part of their publication. It is for this eventuality that a policy on their place within the journal should be established. Are they part of the integral intellectual content, or are they merely “related material”? Will the AAA limit the formats which they are willing to take responsibility for preserving? Will they develop a system of “levels of service” for these items similar to that of Harvard’s DRS? These same issues are applicable to the threaded discussions component. The questions surrounding these two components should be answered and policy developed prior to their advent as integral parts of the journal publications.

Since AAA is both the publisher and the archiving body of these journals, some of the information which Harvard defines as out of their scope may be within the scope of Anthrosource. These components mainly consist of business information, including information for authors about copyright transfer, subscription information, reprint ordering information, job listings, and customer service information. While these may or may not constitute the “intellectual content” of the journals, they are likely to be important business records for AAA. They also have potential to serve as important tools for research on the anthropological profession, particularly American anthropology. As such, they would deserve consideration for preservation.

How is the authenticity of these various components to be maintained? What are the standards for the production of authentic copies through migration? David M. Levy argues that every time an object is rendered on the screen that it is a “copy” of the original bitstream. Ken Thibodeau states preservation is a process of “preserving the ability to reproduce the objects. The process of digital preservation, then, is inseparable from accessing the object.” This being said, the proof that an object has not been changed will come from its access point: if the representation is the same, then the migration was successful. For AAA’s purposes, the migration would be successful if “the message that it is meant to communicate in order to achieve its purpose is unaltered. This implies that its physical integrity… may be compromised, provided that the articulation of the content and any required annotations and elements of documentary forms remain the same”. In order to verify that the message, i.e. the intellectual content, is not changed, a benchmark is required against which the intellectual content can be checked.

The US InterPARES team has developed “baseline requirements supporting the production of authentic copies of electronic records.” These are split into three groups: control over records transfer, maintenance and reproduction; documentation of reproduction process and its effects; and archival description. The first group, that of control over the records, consists of documentation of unbroken custody, security measures, and any required annotations to the records. This is a synopsis of the functions which are filled by analog repositories in their functions as authenticators of the documentation that they house. This is basically a requirement that the archiving body fulfill its role as a “trusted repository” with its concomitant ability to authenticate the records in its possession. This role, and the documentation proving that this role has been filled, is central to establishment of the authenticity of the content held by any digital (or otherwise) repository. “Authentication of preserved objects is ultimately a matter of trust. There are ways to reduce the risk entailed by trusting someone, but ultimately, you need to trust some person, some organization, or some system or method that exercises control over the transmission of information over space, time, or technological boundaries”.

The second set of requirements, full documentation of any reproduction, echoes David Bearman and Jennifer Trant when they say that transparent methods of representation help authenticate a digital surrogate. The concepts of the surrogate and the “authentic copy” are not far removed from one another and in many cases need to be treated similarly. This second set requires the date of the reproduction and the responsible person; information regarding the identity of the record; the impact of the reproduction process on their form, content, accessibility and use; and the relationship between the original records and their reproduction. The description of the last two attributes is dependent on the “desired purpose” of the document or the “essential attributes” for preservation. The impact of the reproduction process on the records cannot be defined without some definition of the content and use. This is also true of the relationship between the original and the reproduction: how can this be stated without a definitive statement of what was important in the original? This will, once again, be dependent upon the results of the user study being conducted by Bonnie Nardi. Increased precision in this area will increase the assumption of authenticity of the copy.

The third set of baseline requirements is more problematic than the first two. The US InterPARES team says that “once the records no longer exist except as authentic copies, the archival description is the primary source of information about the history of the record, that is, its various reproductions and the changes to the record that have resulted from them.” They recommend that this history of reproductions and migrations be kept in the aggregate as “archival description” rather than as a record of every reproduction of a single record. If they are performing mass migrations, they may create and maintain aggregate documentation. This does not preclude production of granular documentation, however. It also may or may not “obviate the need to preserve all the documentation for each and every reproduction.” The decision about the level of detail produced and maintained regarding these procedures is one which the AAA needs to make in regard to its own electronic record keeping system. As above, the level of documentation needed for authenticity purposes will depend on the “desired purpose” of the reproduction and the digital object as well as the record keeping system employed by AAA.

The “benchmark” standard for verifying the intellectual content after migration implies that there will be some redundancy in the storage and preservation of that intellectual content. This should be achieved by the maintenance of both presentation files and source files. The source files are the documents which are produced by the publisher before they have been presented to the consumer. Presumably, the source files contain complete intellectual content as described above. Maintaining these depends on complete knowledge of the construction or definition of the XML or SGML DTD used in production, as well as the style sheet or rendering program used. Beyond that, the semantics of the schema need to be preserved so that the “meaning of the mark up elements and how their composition turns a set of angle bracketed words and sentences into a journal that speaks from mind to mind” can be understood and hopefully reproduced. Since Atypon will be using the archiving and interchanging XML DTD created by the National Library of Medicine for electronic journals, AAA will be able to obtain and preserve the complete documentation of the DTD. The style sheets are also important for knowledge of the manipulations performed on the source file for digital presentation. This documentation is important for preservation as well as to make the source files transparent and usable to a researcher.

The presentation files of electronic journals are often available in two different formats: HTML and PDF. While both of these formats present some challenges to their preservation, the problems surrounding PDF are greater. While Adobe does now publish the specifications for the format, it is a proprietary format which is rendered through proprietary software. If Adobe were to change management, it is possible that these specifications would no longer be published. This would leave the repository in the awkward position of having many files which they are no longer capable migrating accurately and which it is illegal to reverse engineer. Since this is a popular format for web presentation, however, the archives may accept this into the repository with some caveats. Harvard Steering Committee has developed guidelines for publishers who wish to archive their journals in PDF: encode them in full text rather than as page images, use standard compression, and do not encrypt them. PDF-a may become a viable alternative for this as it becomes closer to being published and implemented. PDF, however, should not serve as the master archival copy. HTML is an open standard which must be rendered through standard Web browsers. As such, migration of the content can always be based on the full specification with which the document was created. In addition, maintaining the style sheet used to transform the XML source file into HTML will enhance the knowledge of the document’s construction.

The University of California Press’s suggestion of archiving the PDF presentation files as TIFFs is not sufficient to ensure the preservation of the electronic journal content being created and converted. While TIFF is an open standard, a robust technology, and has commercial grade tools available in order to deal with it, the archive should handle objects that can be used in more diverse ways. “The issue of preserving tagging is a non-issue really: information tagging can be done in automated ways and indeed Atypon has done so for over 5 years; the technology for doing that is constantly improving; linking is already possible on scanned articles; preservation of tagging is not what long-term archival storage is about, i.e. it is not an essential problem.” This, obviously, is not the viewpoint adopted here. The tagging is important and should be preserved, regardless of the fact that preserving the images may be a simpler problem.

Only in the case of retrospectively converted journals should TIFFs be the “archival master.” They should be scanned as TIFF’s, with the un-manipulated image file used as the archival master. A PDF, or JPEG etc., can be generated from the TIFF for presentation on the web to serve as the use format if necessary. Documentation of the presentation file should be maintained in addition to the presentation file itself, so that a future user who may only have access to the archival master (or source file) will know if it had undergone OCR or been transformed into PDF before presentation on the web. For the born digital journals, however, the XML version contains the richest information about the content. XML, while not a perfect solution, is an open standard with many preservation qualities. If the style sheets are maintained, the look and feel of the documents will be available in addition to the rich content. This version should be the archival master, rather than the page images. The TIFF images of the PDF’s may be “good to have”, and better than the proprietary PDF. They should, however, serve as the use or service copy, rather than the archival master. They will be easier to use for general research purposes than raw markup.

Another possibility for preservation of presentation files would work in conjunction with the LOCKSS program as it becomes more widely implemented. A LOCKSS server caches content to which it subscribes, creating a mirror site for the presentation files. While this would obviously not be a primary strategy, and care would have to be taken in the license writing, this may become a viable supplementary strategy for maintaining the intellectual content of the journals. This does not mean that presentation files should not be archived locally; it would be disaster planning purely and would not diminish the AAA’s responsibility to archive the content that they create. The archival masters stored on the Anthrosource server would serve as the authoritative and authentic copies, while the LOCKSS copies would simply ensure distributed intellectual content which may aid in long term preservation.

The problem of establishing and preserving the authenticity of digital objects is not one that has been solved definitively. There are, however, several guidelines which may be followed in order to diminish the uncertainty surrounding these issues. An overall guideline is related to the role of the trusted repository which maintains authentic records through established procedures. The work of the repository to authenticate its own records will eventually lead to a presumption of authenticity based on a journal’s presence in Anthrousource; this is an exercise in translating the “power of archives to authenticate” to the digital realm. One of the most important methods for doing this is to create very complete documentation about the ways in which these documents are created, maintained, and migrated, thus validating the trust vested in the repository.

Another guideline for establishing the authenticity of the intellectual content is to build redundancy into the system. Keeping the source files as well as the presentation files ensures the ability to verify that the text and images that were part of the intellectual content have not been altered through a migration process. This is related to the recommendation of cooperation with the LOCKSS program which would create caches of the presentation files at those institutions which both subscribe to Anthrosource and have LOCKSS servers. While there will be one authoritative “archival master” which is stored on the Anthrosource server, having multiple copies in multiple institutions will prevent the loss of the intellectual content. The final policy will need to take all of these considerations into account and be specific yet forward looking enough to be adaptable to future technological changes which can either aid or impair the preservation and authenticity of the content held in the repository.

 

Authenticity And Preservation of Digital Materials for Scholarly Use: Pathfinder and Sources

Digital Authenticity: The Issues & Some Ideas About Solutions

Bearman, David and Jennifer Trant. “Authenticity in a Digital Age.” D-Lib Magazine (June 1998) Retrieved February 13, 2004 from http://www.dlib.org/dlib/june98/06bearman.html.

This is a good resource about authenticity and its applicability to digital documents used for scholarly purposes, with a concentration on reformatted objects.

Bellinger, Meg, Laura Campbell, Margaret Hedstrom, Deanna Marcum, Kenneth Thibodeau, Donald Waters, Titia van der Werf, Collin Webb. The State of Digital Preservation: An International Perspective. Washington, D.C.: Council of Library and Information Resources, 2002. Retrieved February 10, 2004 from http://www.clir.org/pubs/reports/pub107/pub107.pdf

While this focuses on specific problems and solutions to digital preservation, there is an overriding concern with the maintenance of authenticity of the digital objects which are being preserved.

Cullen, Charles T., Peter B. Hirtle, David Levy, Clifford A. Lynch, and Jeff Rothenberg. Authenticity in a Digital Environment. Council on Library and Information Resources, 2000. Retrieved on February 12, 2004 from http://www.clir.org/pubs/reports/pub92/contents.html.

This white paper by CLIR brings together 5 different perspectives of authenticity, coming from different backgrounds. An archivist, a librarian/legal historian, the executive director of CNI, a computer scientist, and a documentarian contribute their thoughts on authenticity and the ways in which the concepts surrounding it have been changed by the advent of digital technology.

Duranti, Luciana. “Authenticity and Reliability: the Concepts and the Implications.” Archivaria 39: 1-10.

This is the canonical definition of these concepts.

Findings on the Preservation of Authentic Electronic Records: Final Report to the National Historical Publications and Records Commission. US-InterPARES Project, September 2002. Retrieved February 2, 2004 from http://www.gseis.ucla.edu/us-interpares/pdf/InterPARES1FinalReport.pdf

This includes both benchmarks for the assumption of authenticity and baseline requirements for creation of authentic copies. It is a good start for determining the kinds of metadata which need to be created strictly to establish and maintain authenticity.

Hofman, Hans “Can Bits and Bytes be Authentic? Preserving the Authenticity of Digital Objects.” In Proceedings annual IFLA (International Federation of Library Associations and Institutions) conference, Glasgow, Scotland. Retrieved on March 20, 2004 from http://daedalus.lib.gla.ac.uk/archive/00000039/01/hofman_glasgow02.pdf

A good synthesis of the issues of authenticity and preservation of digital objects.

Olsen, Florence. “Archivists Praise PDF/A.” Federal Computer Week, March 10, 2004. Retrieved on March 19, 2004 from http://www.fcw.com/fcw/articles/2004/0308/web-pdf-03-10-04.asp.

A possible new format for digital preservation which may enable simpler authenticity maintenance.

PADI: Preserving Access to Digital Information. "Authenticity." National Library of Australia. Retrieved on March 23, 2004 from http://www.nla.gov.au/padi/topics/4.html

A great starting place for finding sources concerning authenticity and preservation of digital materials. Some of the links no longer seem to work, but in general it is a great place to locate information about this topic.

Preservation and Authenticity of Digital Scholarly Journals: the Mellon Planning Grants

“Minimum criteria for an archival repository of digital scholarly journals.” Digital Library Federation, 2000. Retrieved on March 2, 2004 from http://www.diglib.org/preserve/criteria.htm.

This is a truly minimum requirement for an archival repository of digital scholarly journals. It is a good starting point for policy building, however.

Harvard University Library Mellon Project Steering Committee, Harvard University Library Mellon Project Technical Team “Report on the Planning Year Grant For the Design of an E-journal Archive.” Digital Library Federation, 2002. Retrieved on February 20, 2004 from http://www.diglib.org/preserve/harvardfinal.html.

The most comprehensive document, with some good guidelines as to how to deal with the complexity of digital scholarly journals.

Ockerbloom, John Marck. “Report on Mellon Funded Planning Grant For Archiving Scholarly Journals”. Digital Library Federation, 2002. Retrieved on Febrary 24, 2004 from http://www.diglib.org/preserve/upennfinal.html.

The University of Pennsylvania's Mellon Planning Grant: they had to change many of their assumptions throughout the process of planning this. It is a good template for understanding the difficulties of thinking through the many issues involved in maintaining an archive of digital journals. They also strongly advocate keeping several versions (presentation and source files) in order to maintain the intellectual content.

Stanford University Libraries. “LOCKSS: A Distributed Digital Archiving System.” Digital Library Federation, 2002. Retrieved on February 25, 2004 from http://www.diglib.org/preserve/stanfordfinal.html.

A different perspective on digital preservation. This is an interesting avenue of research which bears further study. For more comprehensive information on the project, go to http://lockss.stanford.edu/.

The remainder of the Mellon planning grants are linked to from the Digital Library Federation's website at http://www.diglib.org/preserve/ejp.htm. Some of them are very helpful, but not terribly applicable to the AAA project at hand. For example, the Cornell and the New York Public Library both were concerned with collecting journals and other materials on particular subjects, rather than by particular publishers. While this may in the end be within AAA's scope, it is not at the moment. MIT was more concerned with dynamic e-journals, which AAA will most likely not be producing in the near future, while Yale was much more concerned with intellectual property concerns than AAA needs to be.