This pathfinder is intended to demystify the complexities associated with the storage and data management aspect of digital information objects as they relate to the American Anthropological Association (AAA) Anthrosource project. In addition, specific recommendations will be addressed based on the the agreement between Atypon, University of California Press (UCP), and the AAA, with regards to the preservation and storage of AAA's current and future digital objects. Recommendations will also be based on previous entities and projects that have successfully implemented Open Archival Information System (OAIS) compliant digital repositories such as the Cedars project, the Harvard E-journal project, and the DSpace software developed by MIT. Rather than focus on the OAIS system itself, this pathfinder will focus on the process by which digital information objects enter, are manipulated by, and are delivered by the digital repository.
Based on the Open Archival Information System model developed by the CCSDS, there are three data structures that are processed through the digital repository called Information Packages. These three data structures are the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). There are six functional entities that govern and manipulate the Information Packages as they are processed. The SIP is submitted to the archives through a process called Ingest. After Ingest, the SIP is converted to an AIP and is managed by the Data Management and Archival Storage components. Finally, the AIP is converted to a DIP and delivered to the consumer via the Access component. These four components are all governed by the Administration and Preservation Planning function entities.
Before going any further with an in-depth discussion of Information Packages, it is important to understand what happens to digital information objects before they can enter the archives.
Before digital objects can be submitted to the archive, they must have the requisite metadata tagging, or Preservation Description Information (PDI) as the OAIS model calls it. An agreement between the producer and the digital repository called the SIP agreement must be in effect to guarantee that the incoming digital information object, has the requisite PDI. Preservation Description Information comes in four categories: Context, Provenance, Fixity, and Reference. In addition, there must also be Technical metadata or Representation Information (RI) for all files that inform the computer how to interpret and display digital files. From the perspective of the user, the RI information is largely taken for granted. It is RI embedded within most files on PCs that allow the user to simply double-click the file and have the appropriate application open and display the file. The RI is critical when the time comes to migrate digital files.
The digital information object with its requisite RI and PDI is what constitutes a Submission Information Package. For a more detailed account of what PDI metadata tags should be required by AAA for the submission of their digital objects, specifically their electronic journal articles (retrospectively converted and born-digital), see Uri's guide to AAA metadata. Here, I will only summarize the overlying metadata structure of in coming SIPs.
METS, Metadata Encoding & Transmission Standard has been used by Harvard's E-journal project to provide a hierarchical structure to the Preservation Description Information (PDI). At the item-level, i.e. individual articles, Atypon states that it will incorporate the National Library of Medicine (NLM) DTD into the production workflow system. Atypon also states that they offer html or pdf display. For long-term preservation, the submission information package (SIP) delivered to the archives should contain metadata that links to both pdf and html files. In addition, the XML tags should be encoded directly into the html files, or html files should be converted to XML. Should separate abstract files be created, metadata should link the (likely html) abstract with the full-text html and pdf files. In the AAA proposal there is no mentioned of the preservation of linkages to digital video, audio, or images (A/V/I) embedded within articles. Atypon does mention preserving A/V/I formats, but does not go into detail. We recommend that video, audio, and images be tagged in XML during workflow (prior to ingest) and that the following standard formats be used: MPEG, AIFF, TIFF, respectively.
The case for XML over multi-page TIFF
Currently, Atypon proposes to convert of PDF to multi-page TIFF as the master format. I do not believe that is acceptable for long term preservation nor for future access. By converting to TIFF, Atypon is only creating more work for itself down the line (assuming at some point that they are only left with the master file). Assuming a TIFF is all they have to work with in the future, they will have to use more resources to convert TIFF to HTML than XML to HTML (or PDF). XML is also an ISO standard just as TIFF is. By using XML converted HTML files, URIs can be maintained for any A/V/I files embedded within. This will facilitate consistent search and access for the future. As one author states,
It also allows for the user to possibly search only for A/V/I files in
the future. In addition, XML will allow AAA to provide services that it
has not yet considered in its proposal such as access to materials through
a PDA, cell phone, e-book, and other XML based information technology
that has yet to be invented.
As mentioned earlier, the SIP is a combination of the digital information object(s), its Representation Information (RI), and its Preservation Description Information (PDI). PDI is the standard name given for metadata in the OAIS model. PDI information The SIP can be made of multiple files. For example, the SIP for one article from a AAA journal may contain the HTML, the PDF, and (hopefully for born-digital objects) the XML file of the article. In addition, the SIP would also contain RI and PDI for all three files in one metadata file.
Once a standard workflow process has been implemented and metadata encoding has been standardized throughout, the SIP can be submitted to the archives. Upon its entrance, more metadata is created and added to the SIP's PDI. Provenance information that details the processing history of the SIP is updated with a record of its submission that must deliver a return-receipt to the producer. Or, if the SIP is missing mandatory PDI metadata, a notice of resubmission should be sent. This Provenance information is extremely important in order to keep an audit trail of submissions. Once the SIP is accepted, Fixity information such as a checksum or a Cyclic Redundancy Check (CRC) on the files is added to the PDI prior to assigning Reference information in the form of a uniform resource identifier (URI). URIs should also be assigned to embedded A/V/I files contained within Sips
The DSpace software created
by MIT and HP automatically assigns PDI information required during ingest.
Below is an example of what PDI DSpace assigns.
The Archival Information Package can be unclear sometimes because it is amorphous. As above, an AIP can be an aggregate of journal articles into an issue. The OAIS model here makes the distinction that the digital information objects, i.e. the journal articles (which exist in several file formats) are separately known as Archival Information Units (AIC). So, many AICs make up one AIP. However, one AIP could also be an aggregate of journal issues into one journal collection under the journal title. Thus, one AIP could contain and aggregated quantity of other AIPs. This "collection" of AIPs into one AIP is called an Archival Information Collection (AIC). It is at this point that DSpace functionality deviates drastically from the OAIS model. In the OAIS model, the AIP and its overarching AIC are required to have descriptive information that would facilitate retrieval of either an an AIP or entire AIC. That is, the OAIS follows a traditional archives workflow. Based on this model, a user would request an AIP, the archives staff would copy the electronic files onto a medium and then physically deliver the AIP to the user regardless of the fact that the user may have only wanted one digital information object/AIC/journal article.
Because the technology associated with storage and data management has increased as well as user's access to higher bandwidth, DSpace allows users to search and retrieve individual files, making the elaborate AIP structure unnecessary. Instead, the AIP structure allows for better administrative management of the data rather than as a tool for better access.
In the Cedars project, the AIP structure was used to delineate levels of access or as they regard it, the "internal archival states of the AIP." In the Cedars projects, , AIPs are separated by how fully they are available for access. From maximum to medium to minimum availability, these distinctions are determined by how high or low the demand for the digital object might be. When the time comes for AAA to begin archiving more than just electronic journals, the Cedars project may serve as a good example on how to provide the best access when balancing between processing time and user access.
Before moving on to the Dissemination Information Package (DIP), I would like to further clarify the relationship between the SIP and AIP. The OAIS model provides descriptions with some good examples of the different combinations of various Sips and AIPs.
"The mapping between Sips and AIPs is not one-to-one. Here are some examples:
One SIP - One AIP: A government agency is ready to archive its electronic
records from the previous fiscal year. All of the year's records are placed
onto magnetic tapes that are submitted as one SIP. The archive stores
the tapes together as a single AIP.
The Dissemination Information Package is a product of the Access functional entity. Access provides the user with an interface that allows the user to search the archives. Queries are relayed to the data management entity, which interacts with the archival storage entity to produce the desired AIP(s). However, AIPs may contain information that the user should not have access to such as Administrative PDI, or private donor information. The requested DIP may also contain AIPs from multiple collections (AICs). The Access function must be able to process and coordinate the various restrictions and must be able to aggregate many disparate AIPs into one package.
In DSpace, the DIP would not be so much an Information Package as it would be a result set. The user would then request a DIP from a link in the result page. The requested digital information object would still need to be processed to remove restricted information from the user's view.
Once a process for has been determined for the digital archives, a storage strategy should be assessed. Currently, Atypon states, "All content is hosted at state-of-the-art IBM co-location facility. In addition, backup data storage is also provided with DataSafe. Once a week a Data Safe representative comes to the IBM facility to pick up tapes bound for an offsite storage." In the current agreement, the above statement is the only allusion Atypon makes towards their storage capabilities.
It is the recommendation of this project that AAA and Atypon iron out a more explicit storage plan. For instance, where will the archives storage servers be exactly? It is obvious that the Anthrosource portal and all information associated with it will be stored by the IBM co-location facility. That is one server holding AAA material. How many more servers need exist? We recommend at least eight separate servers to hold and provide access to AAA's digital archives.
Multiple servers are crucial for three very important reasons:
For webhosting, the IBM facility will likely have their own safeguards in place. The next server will serve publication workflow data. The physical location of this server will need to be determined. Will UCP host AAA's publication workflow or will AAA host their own? For the sake of simplicty, but mostly for security, it is recommended that all publication data reside in one place with adequate provisions for data back up.
For the archives storage, there are three sections: the metadata relational database, file storage, and "dark" storage. "Dark" storage is digital material that requires a server with greater security than that of file storage or the database. Dark storage would consist of digital material that has some sort of access restriction on it.
The archives will also need to set up mirror sites, or backup servers
geographically removed from the current hosting servers. Traditionally,
archives have been very cooperative with each other. An option for backup
may be an agreement with another repository in which they back up AAA's
digital objects as long as AAA does the same for them. To find a repository
willing to do this, some digital archives are getting involved with "bid
trading: a mechanism where sites conduct auctions to determine who
to trade with. A local site wishing to make a copy of a collection announces
how much remote space is needed, and accepts bids for how
In addition, it may also be necessary in the future to offer more servers, customized to the storage and access needs of various other groups that wish to submit material to the archives. For instance, the current model is specifically designed for the storage and access of electronic journals. In the future, AAA and its members may wish to submit field notes, representations of artifacts, diaries, even personal papers of working or retired Anthropologists.
It may also be necessary to keep one server permanently offline that backs up all of the AAA content for added protection of the both the content and the authenticity of the content. Only administrators would have access to this server.
Below is a diagram of the minimum number of servers AAA should employ to ensure safe, efficient, an secure access and storage of all content.
In the next section: How do users get access to the system while keeping the system secure?