Back to Top

 

Data Structures Pathfinder

 

Introduction

This pathfinder is intended to demystify the complexities associated with the storage and data management aspect of digital information objects as they relate to the American Anthropological Association (AAA) Anthrosource project. In addition, specific recommendations will be addressed based on the the agreement between Atypon, University of California Press (UCP), and the AAA, with regards to the preservation and storage of AAA's current and future digital objects. Recommendations will also be based on previous entities and projects that have successfully implemented Open Archival Information System (OAIS) compliant digital repositories such as the Cedars project, the Harvard E-journal project, and the DSpace software developed by MIT. Rather than focus on the OAIS system itself, this pathfinder will focus on the process by which digital information objects enter, are manipulated by, and are delivered by the digital repository.

Based on the Open Archival Information System model developed by the CCSDS, there are three data structures that are processed through the digital repository called Information Packages. These three data structures are the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). There are six functional entities that govern and manipulate the Information Packages as they are processed. The SIP is submitted to the archives through a process called Ingest. After Ingest, the SIP is converted to an AIP and is managed by the Data Management and Archival Storage components. Finally, the AIP is converted to a DIP and delivered to the consumer via the Access component. These four components are all governed by the Administration and Preservation Planning function entities.

Before going any further with an in-depth discussion of Information Packages, it is important to understand what happens to digital information objects before they can enter the archives.

Workflow Metadata

Before digital objects can be submitted to the archive, they must have the requisite metadata tagging, or Preservation Description Information (PDI) as the OAIS model calls it. An agreement between the producer and the digital repository called the SIP agreement must be in effect to guarantee that the incoming digital information object, has the requisite PDI. Preservation Description Information comes in four categories: Context, Provenance, Fixity, and Reference. In addition, there must also be Technical metadata or Representation Information (RI) for all files that inform the computer how to interpret and display digital files. From the perspective of the user, the RI information is largely taken for granted. It is RI embedded within most files on PCs that allow the user to simply double-click the file and have the appropriate application open and display the file. The RI is critical when the time comes to migrate digital files.

The digital information object with its requisite RI and PDI is what constitutes a Submission Information Package. For a more detailed account of what PDI metadata tags should be required by AAA for the submission of their digital objects, specifically their electronic journal articles (retrospectively converted and born-digital), see Uri's guide to AAA metadata. Here, I will only summarize the overlying metadata structure of in coming SIPs.

METS, Metadata Encoding & Transmission Standard has been used by Harvard's E-journal project to provide a hierarchical structure to the Preservation Description Information (PDI). At the item-level, i.e. individual articles, Atypon states that it will incorporate the National Library of Medicine (NLM) DTD into the production workflow system. Atypon also states that they offer html or pdf display. For long-term preservation, the submission information package (SIP) delivered to the archives should contain metadata that links to both pdf and html files. In addition, the XML tags should be encoded directly into the html files, or html files should be converted to XML. Should separate abstract files be created, metadata should link the (likely html) abstract with the full-text html and pdf files. In the AAA proposal there is no mentioned of the preservation of linkages to digital video, audio, or images (A/V/I) embedded within articles. Atypon does mention preserving A/V/I formats, but does not go into detail. We recommend that video, audio, and images be tagged in XML during workflow (prior to ingest) and that the following standard formats be used: MPEG, AIFF, TIFF, respectively.

The case for XML over multi-page TIFF

Currently, Atypon proposes to convert of PDF to multi-page TIFF as the master format. I do not believe that is acceptable for long term preservation nor for future access. By converting to TIFF, Atypon is only creating more work for itself down the line (assuming at some point that they are only left with the master file). Assuming a TIFF is all they have to work with in the future, they will have to use more resources to convert TIFF to HTML than XML to HTML (or PDF). XML is also an ISO standard just as TIFF is. By using XML converted HTML files, URIs can be maintained for any A/V/I files embedded within. This will facilitate consistent search and access for the future. As one author states,

"It is at this level of functionality (full) that preservation will occur. Although end-users currently access the journal in HTML, these pages are created on the fly from SGML. For archive purposes the archive takes the SGML files. Therefore the technical metadata (or representation information) which is required includes robust technical descriptions of the objects including information about the systems and the software necessary to run the video and sound as well as less complex technical metadata about retrieving the text and images." - Kelly Russell, "Digital Preservation and the Cedars Project Experience."

It also allows for the user to possibly search only for A/V/I files in the future. In addition, XML will allow AAA to provide services that it has not yet considered in its proposal such as access to materials through a PDA, cell phone, e-book, and other XML based information technology that has yet to be invented.
It should be noted that AAA asks if Atypon can support OAI-compliant metadata harvesting protocols, but the question is never answered. Instead, Atypon dodges the question and address only their storage capabilities.

The SIP

As mentioned earlier, the SIP is a combination of the digital information object(s), its Representation Information (RI), and its Preservation Description Information (PDI). PDI is the standard name given for metadata in the OAIS model. PDI information The SIP can be made of multiple files. For example, the SIP for one article from a AAA journal may contain the HTML, the PDF, and (hopefully for born-digital objects) the XML file of the article. In addition, the SIP would also contain RI and PDI for all three files in one metadata file.

Ingest

Once a standard workflow process has been implemented and metadata encoding has been standardized throughout, the SIP can be submitted to the archives. Upon its entrance, more metadata is created and added to the SIP's PDI. Provenance information that details the processing history of the SIP is updated with a record of its submission that must deliver a return-receipt to the producer. Or, if the SIP is missing mandatory PDI metadata, a notice of resubmission should be sent. This Provenance information is extremely important in order to keep an audit trail of submissions. Once the SIP is accepted, Fixity information such as a checksum or a Cyclic Redundancy Check (CRC) on the files is added to the PDI prior to assigning Reference information in the form of a uniform resource identifier (URI). URIs should also be assigned to embedded A/V/I files contained within Sips

The DSpace software created by MIT and HP automatically assigns PDI information required during ingest. Below is an example of what PDI DSpace assigns.

* Assigns an accession date
* Adds a "date.available" value to the Dublin Core metadata record of the item
* Adds an issue date if none already present
* Adds a provenance message (including bitstream checksums)
* Assigns a Handle persistent identifier
* Adds the item to the target collection, and adds appropriate authorization policies
* Adds the new item to the search and browse indices
* (Soon) creates and archives an AIP

-- Ingest Process and Workflow, DSpace System Documentation: Functional Overview


As Harvard reported in a year-end report on their E-journal archive, Ingestion should take an issue-centric approach. That is, as individual article Sips comes into the archives, they would be aggregated into an issue as they would in printed form to form an Archival Information Package (AIP). The METS structure allows for three level hierarchical structure, from journal title level, to issue level, followed by the item-level or article level. It should be noted that this hierarchical structure is only a necessity of data management and does not limit the user to searching by the METS hierarchical divisions. The METS structure (containing the NLM DTD metadata) would then be extracted and submitted to the data management functional entity. Actual files would be transferred to archival storage.

The AIP

The Archival Information Package can be unclear sometimes because it is amorphous. As above, an AIP can be an aggregate of journal articles into an issue. The OAIS model here makes the distinction that the digital information objects, i.e. the journal articles (which exist in several file formats) are separately known as Archival Information Units (AIC). So, many AICs make up one AIP. However, one AIP could also be an aggregate of journal issues into one journal collection under the journal title. Thus, one AIP could contain and aggregated quantity of other AIPs. This "collection" of AIPs into one AIP is called an Archival Information Collection (AIC). It is at this point that DSpace functionality deviates drastically from the OAIS model. In the OAIS model, the AIP and its overarching AIC are required to have descriptive information that would facilitate retrieval of either an an AIP or entire AIC. That is, the OAIS follows a traditional archives workflow. Based on this model, a user would request an AIP, the archives staff would copy the electronic files onto a medium and then physically deliver the AIP to the user regardless of the fact that the user may have only wanted one digital information object/AIC/journal article.

Because the technology associated with storage and data management has increased as well as user's access to higher bandwidth, DSpace allows users to search and retrieve individual files, making the elaborate AIP structure unnecessary. Instead, the AIP structure allows for better administrative management of the data rather than as a tool for better access.

In the Cedars project, the AIP structure was used to delineate levels of access or as they regard it, the "internal archival states of the AIP." In the Cedars projects, , AIPs are separated by how fully they are available for access. From maximum to medium to minimum availability, these distinctions are determined by how high or low the demand for the digital object might be. When the time comes for AAA to begin archiving more than just electronic journals, the Cedars project may serve as a good example on how to provide the best access when balancing between processing time and user access.

Before moving on to the Dissemination Information Package (DIP), I would like to further clarify the relationship between the SIP and AIP. The OAIS model provides descriptions with some good examples of the different combinations of various Sips and AIPs.

"The mapping between Sips and AIPs is not one-to-one. Here are some examples:

One SIP - One AIP: A government agency is ready to archive its electronic records from the previous fiscal year. All of the year's records are placed onto magnetic tapes that are submitted as one SIP. The archive stores the tapes together as a single AIP.

Many Sips - One AIP: A satellite sensor makes observations of the Earth over a period of one year. Every week all of the latest sensor data are submitted to the archive as a SIP. The archive has a single AIP containing all of the sensor's observations for the year. Ingest merges the Content Information from each weekly SIP into a specified file/files in Ingest persistent storage. The PDI data for the AIP is sent after the last sensor data for the year has been received. After all of the weekly Sips and the SIP containing the PDI have arrived, Ingest processes the AIP.

One SIP - Many AIPs: A company submits financial records to an archive as one SIP. The archive chooses to store this information as two AIPs: one that contains public information and the other that contains sensitive information. This makes it easier for the archive to manage access to the information.

Many Sips - Many AIPs: An oil and gas company collects information on its wells. Every year it submits Sips containing all of the well status information for one well to an archive. The archive maintains one AIP for each oil or gas field and breaks out the information on each well to the proper AIP based upon its geographic coordinates.

One SIP - No AIPs: An investigator, or archive personnel, creates a new algorithm for detecting hurricanes in images. He runs this algorithm over all the images contained in an archive. This data is combined into either a new Associated Description or a set of Package Description updates which is input as a SIP."

-- 4.3.2 Data transformations in the ingest functional area, OAIS reference model (p 86-87).

The DIP

The Dissemination Information Package is a product of the Access functional entity. Access provides the user with an interface that allows the user to search the archives. Queries are relayed to the data management entity, which interacts with the archival storage entity to produce the desired AIP(s). However, AIPs may contain information that the user should not have access to such as Administrative PDI, or private donor information. The requested DIP may also contain AIPs from multiple collections (AICs). The Access function must be able to process and coordinate the various restrictions and must be able to aggregate many disparate AIPs into one package.

In DSpace, the DIP would not be so much an Information Package as it would be a result set. The user would then request a DIP from a link in the result page. The requested digital information object would still need to be processed to remove restricted information from the user's view.

Storage

Once a process for has been determined for the digital archives, a storage strategy should be assessed. Currently, Atypon states, "All content is hosted at state-of-the-art IBM co-location facility. In addition, backup data storage is also provided with DataSafe. Once a week a Data Safe representative comes to the IBM facility to pick up tapes bound for an offsite storage." In the current agreement, the above statement is the only allusion Atypon makes towards their storage capabilities.

It is the recommendation of this project that AAA and Atypon iron out a more explicit storage plan. For instance, where will the archives storage servers be exactly? It is obvious that the Anthrosource portal and all information associated with it will be stored by the IBM co-location facility. That is one server holding AAA material. How many more servers need exist? We recommend at least eight separate servers to hold and provide access to AAA's digital archives.

Multiple servers are crucial for three very important reasons:

  • Faster the processing time.
  • Custom storage options for differing data structures
  • If one server fails, recovery will be less costly and less time consuming

For webhosting, the IBM facility will likely have their own safeguards in place. The next server will serve publication workflow data. The physical location of this server will need to be determined. Will UCP host AAA's publication workflow or will AAA host their own? For the sake of simplicty, but mostly for security, it is recommended that all publication data reside in one place with adequate provisions for data back up.

For the archives storage, there are three sections: the metadata relational database, file storage, and "dark" storage. "Dark" storage is digital material that requires a server with greater security than that of file storage or the database. Dark storage would consist of digital material that has some sort of access restriction on it.

The archives will also need to set up mirror sites, or backup servers geographically removed from the current hosting servers. Traditionally, archives have been very cooperative with each other. An option for backup may be an agreement with another repository in which they back up AAA's digital objects as long as AAA does the same for them. To find a repository willing to do this, some digital archives are getting involved with "bid trading: a mechanism where sites conduct auctions to determine who to trade with. A local site wishing to make a copy of a collection announces how much remote space is needed, and accepts bids for how
much of its own space the local site must “pay” to acquire that remote space."

-- Cooper, Brian F. and Hector Garcia-Molina. "Peer-to-peer data preservation through storage auctions."

In addition, it may also be necessary in the future to offer more servers, customized to the storage and access needs of various other groups that wish to submit material to the archives. For instance, the current model is specifically designed for the storage and access of electronic journals. In the future, AAA and its members may wish to submit field notes, representations of artifacts, diaries, even personal papers of working or retired Anthropologists.

In the next section: How do users get access to the system while keeping the system secure?