Valley of the Shadow

Overview

The University of Virginia Library is interested in supporting and collecting digital scholarship of the highest caliber and providing the same level of public and research access that is possible for library materials collected in more traditional ways. The opportunity to collect, disseminate and preserve “The Valley of the Shadow” by the Digital Curation Services group has been a successful experiment; one in which new technologies have been implemented, professional relationships across the university have been forged and best practices documentation has been created which will provide a framework for SDS projects of the future.

The goal of this project was one in which “The Valley of the Shadow” would be collected by the Library and disseminated digitally through the Library’s managed content environment (LMC). Generally stated, this goal would be met by completing the following:

  • Consolidate the files from three servers onto one Library server, removing duplicative files and applications
  • Update and normalize the encoding and formats of the content files
  • Update the technology used for the delivery applications, primarily searching

The Library has made every attempt to collect “The Valley of the Shadow” at a service level commensurate with the technical specification of each digital object. The end result of this action has meant that some objects will be more easily preserved over the long-term than others, i.e. materials based on the Library’s digital object and metadata creation standards. Not all “The Valley of the Shadow” digital materials were created adhered to Library standards. Throughout the project to collect and deliver “The Valley of the Shadow,” a conscious effort was made to retain the look and feel of the original archive.

Project Plan

The SDS project for “The Valley of the Shadow” consisted of two major “tracks” carried out in tandem:

I. Administrative Track: Project Planning & Documentation Planning included timeline tracking and consultations between both “sides” of the project: the Digital Curation Services group in regard to the Library’s managed content environment (LMC) and the original VCDH project staff about existing functionality of the site. Documentation included tracking tasks and their dates of completion, as well as ensuring that all necessary documentation related to the content (provenance, relevant development history, etc.) was in place.

Project tracking was carried out in the following ways:

  • Provenance and development were tracked using spreadsheets based on documentation provided by VCDH.
  • Overall project progression was tracked using UVA’s Collab site tools. A timetable and a list of actions were created in the wiki section and periodically updated.
  • Student worker schedules were tracked using Google Calendar. Actual student work was also tracked by spreadsheets (mentioned in outline below).
  • Work on “live” test copy of the site was tracked with an electronic ticketing system (also mentioned below). Changes to individual files were also tracked in the change logs of the subversion version control repository.

Meetings were held about every three weeks or as needed with VCDH staff and the former Valley project manager to address administrative concerns and give project updates.

Internal Library meetings were held weekly to track progress and address issues related to planning and use of library resources.

Questions about gaps in provenance information were directed back to VCDH. They retained a graduate student worker to track down documentation related to these issues.

Final documentation for the project included an executive summary of resources used, an inventory of the redesigned site, a best practices documents for future migration projects, a final list of actions taken on the project, and install notes for the redesigned site.

II. Data/Object Track: Manipulation & Management of files and Development & Implementation of Storage and Delivery Technology

Data/Object Track has [5] sequential phases:

  1. Evaluation Phase
  2. File Manipulation Phase
  3. Reconstruction & Development Phase
  4. Testing Phase
  5. Implementation Phase

1. Evaluation Phase

What was done: Familiarization with and assessment of current working Valley site.

How it was done:

  • The Digital Curation Services group and students used and navigated Valley site, approaching site use from a variety of standpoints. They made notes as to functionality, ease of navigation, etc., and they paid particular attention to broken links, “loopholes” into old parts of the site, and to places where functionality could be improved.
  • Located Valley files across the multi-server environment and securing access to those files (in consultation with VCDH and library staff).
  • Relocated files to single working environment using FTP.
  • Students inventoried copied files.
  • Sorted active files from inactive files (files were not removed, but saved for archiving).
  • Files were burned to DVD without editing for archival snapshot purposes.
  • In consultation with VCDH staff, outdated elements of the site were identified for exclusion from sustained site (e.g., poor quality newspaper PDF files).

Why this was done: To establish a reference point for sustaining site functionality, to prepare site files for the move to LMC environment and to accomplish goal of file consolidation.

2. File Manipulation Phase

What was done: Valley files were, where possible, edited to conform to existing library standards for digital material without losing essential project content. XML files were standardized to conform to the UVa. Library DTD. HTML files were converted to XHTML and brought into basic compliance with ADA guidelines. Images were evaluated for eventual rescanning in the case of UVA-owned materials.

How this was done:

  • Assessed work to be done on each type of file (XML, HTML, images) and documented this work using Excel spreadsheets and U.Va. Collab site. Particular attention was paid to which edits could be made programmatically, with automation, and which edits would require manual work.
  • Made initial changes to XML by automation, using Unix tools and Oxygen xml editor.
  • Manual changes to XML were made by students in Oxygen XML editor and were tracked using a spreadsheet.
  • Some changes were made to DTD in consultation with Digital Curation Services staff to accommodate key structural elements of Valley texts.
  • Scripting made additions to XML metadata; this required the writing of new scripting processes.
  • Basic conversion to XHTML was done using open-source Tidy application.
  • Student workers made remaining changes to XHTML (including ADA compliance) manually.
  • Linking throughout the site was changed from absolute to relative paths.
  • QA was performed at each step of the editing process. When edits were complete as tracked, QA scripts were run on the XML files. Errors were corrected.
  • Original copies of image materials were, where possible, located in library or VCDH holdings and rescans requested.
  • Rescanned images were cropped and renamed to match Valley originals.
  • Data tables were converted from Postgres format to XML suitable for import to Lucene/Solr.
  • Newspaper text files were edited to remove links to removed PDF files.
  • Directory structure was consolidated to remove some redundancies.

Why this was done: To prepare files to be effectively sustained as part of library-managed content, to be consistent in format across the project wherever possible, and to meet the goal of normalizing file formats and encoding.

3. Reconstruction and Development Phase

What was done: Rebuilt the site in new environment using current library technologies and prepared site components to be easily sustained over future environment and technology changes.

How this was done:

  • Valley website was configured to be delivered via Apache Cocoon, an Open Source web publishing framework. Specific pipelines and style sheets were designed for the Valley XML files, aiming to balance site consistency and current look and feel.
  • XML files of manuscript and data material were indexed to be searched using Solr/Lucene. Both the index and the Cocoon-based dynamic web pages were served via the Apache Tomcat servlet container.
  • Cocoon pipelines and XSL style sheets were created for new search pages.
  • New search pages were integrated with rest of site; old links were changed to reflect new query functionality and library management of project.
  • Images containing primary intellectual content (i.e., not title or decorative graphics) were catalogued for inclusion in library collection using the image-cataloging IRIS database.
  • Site was set up first on a local (laptop) development machine, connecting to Linux server srdev.lib for search/index functionality.
  • Upon the hiring of a new SDS programmer, the development version of site was transferred to a virtual machine (dcdev.lib.virginia.edu) on the Library VMware server cluster for final development work and troubleshooting. website was consolidated with Lucene index that had been build on another development server.
  • All site content, configuration files and related material checked into Library’s subversion version control repository. All subsequent changes logged and versioned.
  • JIRA electronic ticketing system was used to track steps, prioritize and monitor fixing of bugs or site problems.
  • Address of test site was shared with original project owners and Digital Curation Services staff to begin gathering feedback.
  • Refactoring of Javascript/XSLT code for over twenty search pages, and their related display applications.

Why this was done: To meet goal of updating search and delivery technology; to ensure that Valley content can be easily maintained currently and in future environments still in development; to allow testing of some new library technologies; to ensure that Valley site continues to look and function more or less as it has from the user’s point of view while cleaning up “back end” structure.

4. Testing Phase

What was done: Moved site from development server (dcdev.lib) to test server on library cluster (libsvr07.lib); documented the process and the technology necessary to make such a move; tested functionality of rebuilt site in-house; gathered feedback from original owners of project and made changes, as necessary. Correction of outstanding bugs and issues identified by user experience feedback. Automated checking of all 11,094 hyperlinks in site, redirecting, editing or omitting links where appropriate.

How this was done:

  • SDS programmer documented install procedures and observed LITS staff as they followed these instructions. Installation success was verified by comparison to development instance.
  • Notes were made concerning means of this transition: problems encountered, technology needed, steps taken, etc. Installation documentation revised accordingly, and application layout modified for improved ease of startup and maintenance.
  • Tuning of Solr/Lucene index to improve searches and overall application stability.
  • Research was undertaken as to how to automate tests for broken or malformed links.
  • Integration of Apache Tomcat with Apache Httpd server, permitting seamless delivery of static and dynamic web content.
  • Site was tested for varying user loads and server technologies tuned accordingly.
  • Research into how to secure Cocoon and Solr servlets to enhance overall website security.

Why this was done: To prepare site for live delivery; to test the “portability” of the site with an eye toward future migrations or maintenance issues; to double-check appropriateness of look and feel; to allow further QA of site functionality; to create another working copy of site, ensuring that files will not be lost in any transitions. To increase the stability and robustness of the site, ensuring functionality across operating systems, browsers, and user experience. Ongoing documentation and change logging servers the purpose of ensuring site is maintainable in the long term.

5. Delivery Phase: The site has been installed on the Library web cluster. Redirects have been added to the VCDH website to point users at the new URL. The ‘site contact’ e-mail link has been forwarded to the Scholars’ Lab help page.

Digital Curation Services Project Staff:

  • Bradley Daigle – Director, Digital Curation Services
  • Lorrie Chisholm – Migration Coordinator
  • Brantley Craig – The Valley of the Shadow Migration Project Manager
  • Elizabeth Gushee – Digital Collections Librarian
  • Matthew Stephens – Sustaining Digital Scholarship Programmer

Additional assistance and project contributions from the Library are:

Ann Burns,Dennis Collins, Andrew Curley, Rob Diethorn, Ronda Grizzle, Ethan Gruber, Kristy Haney, Ray Johnson, Leslie Johnston, Jack Kelly, Guy Mengel, Greg Murray, Perry Roland

VCDH Staff: Bill Covert, Will Davis, Scot French, Scott Nesbit, Andrew Torget

Student Staff: Jackie Aslund, Sarah Culpeper, Christopher McVey, Meredith Moore, Richard Murray, Avan Ordel, Molly O’Rourke, Su Park, Darrelynne Strother, Andrew Tarne, and Cathy Tu.