Emily Dickinson Archive Data Importer Proposal

Here is our proposal for the import procedure. It is broken into 6 sections:

  1. Data Needs
  2. Data Sources
  3. Process Overview
  4. Field Sources
  5. Open Questions
  6. Notes

Nouns that correspond to models in the database are capitalized throughout.

Data Needs

The following objects would need to be imported for the first release:

Fascicles
the 40 collections of sheets
Sheets
the physical sheets of paper that make up the fascicles
Pages
the individually imaged pieces of paper (typically 4 per sheet)
Works
the independent bodies of content (multiple works per page or multiple pages per work)
Transcriptions
the machine-readable text of works
Annotations
user-contributed notes about specific portions of transcriptions
Sheet Scholarship
for import, we need Franklin’s suggested order of sheets in fascicles
Links
external links connected to works, pages, sheets, etc.

PageScholarship and FascicleScholarship would not necessarily need to be created as part of the import process.

Data Sources

PDS

PDS contains image URLs, Johnson numbers, Franklin numbers, sheet and sequence numbers, and work titles

PDS must be imported by crawling the pages and parsing the HTML.

OASIS

OASIS contains fascicle numbers, sheets, pages, Johnson numbers, Franklin numbers, HOLLIS call numbers, work titles, and dates

OASIS must be imported by crawling the page and parsing the HTML.

Franklin

Franklin contains Franklin numbers, dates, work titles, and transcriptions.

Franklin can be parsed by finding patterns in the rendering code of the PDFs.

Johnson

Johnson contains Johnson numbers and transcriptions.

Johnson could potentially be parsed by OCRing the PDFs and then finding patterns in the output, but this is questionable. It will most likely become a very manual process.

Process Overview

  1. Automatically create the Collection object for the Harvard Library and the Scholarship objects for Franklin and Johnson.

  2. Import Works, Transcriptions, and Annotations from the Franklin Variorum (and Johnson’s edition). At this point, we will be missing the start page and end page fields for Works.

  3. Import Fascicles and Sheets from the OASIS finding aid. This includes loose sheets.

  4. Import Pages and Links from PDS. Using the Franklin numbers present in the image metadata, we can fill in the start pages for Works. We should add all the page metadata to the notes field of PageScholarship so we can post-process. We can create Links from Works to the PDS record by parsing the “Cite this Resource” HTML.

  5. Post-process our imported data to determine Franklin’s proposed sheet order within fascicles and the end pages of imported Works.

  6. Mark Works for which we do not have Pages. This will prevent the works from showing up in the web UI, which would be useful because Pageless Works are works without associated images. We don’t want to remove the data entirely, as these records could still be useful in export formats.

Field Sources

Field Source Notes
Collection
Name Automatically input See Automatically Input
Location Automatically input
Fascicle
Number OASIS See Fascicle Number
Sheet
Fascicle OASIS See Sheet’s Fascicle
Page
Collection ID Automatically input
Sheet ID Automatically input
Image URL PDS See Image URL
Position in Sheet Automatically input See Page Position in Sheet
Scholarship
Name Automatically input
Author Automatically input
Date Automatically input
Work Number Prefix Automatically input
Completeness Automatically input
Owner Automatically input
FascicleScholarship
Not part of import process
SheetScholarship
Sheet ID Automatically input
Scholarship ID Automatically input
Position in Fascicle Franklin See Position in Fascicle
Notes Not part of import process
PageScholarship
Page ID Automatically input
Scholarship ID Automatically input
Notes PDS See Page Notes
Work
Title Franklin See Work Title
Date OASIS or Franklin See Work Date
Number Franklin See Work Number
Scholarship ID Automatically input
Start Page ID PDS and Franklin See Start Pages
End Page ID Franklin See End Pages
Transcription
Work ID Automatically input
Body Franklin See Transcription Body
Annotation
Not part of import process

Open Questions

Notes

Automatically Input

These values will either be input by the import process itself, or by a program that is designed to input specific data, rather than data scraped from another source. The various IDs are an example of the former, while the Harvard Library Collection is an example of the latter. By entering this data automatically, it will allow us to run the import process numerous times without requiring a user to enter any data.

Fascicle Number

Fascicles will be drawn from the container list in OASIS for the poem collection. This will ensure that fascicles for which we do not have images will not end up in the database.

Sheet’s Fascicle

We can create new sheets and find the fascicle that owns them by parsing the finding aid in OASIS. To get the fascicle for homogeneous packets, we have headings like “Packet IX = Fascicle 33”. For heterogeneous packets, there’s a little list at the top of each packet section that indicates which sheet numbers were assigned to which fascicles by Franklin.

New sheets should be created for each unique number (not number-letter pair) in parentheses and bold in the container list.

Image URL

We’ll have to talk with HLS OIS about how to handle this. We should definitely keep them hosting everything, but we should talk about utilizing their JPEG2000 viewer API so we can create various JPGs on the fly for users.

Page Position in Sheet

Page position in sheet can come from the number (not letter) in parentheses preceding the PDS sequence title. For each fascicle, there are usually sets of four images, where each set has the same number. If we just say the first in each set is a, the second is b, third is c, fourth is d, that would be mostly correct. (The letters are wrong because they break at poems, not pages)

There are issues with this method. There are images of slips and strings and blank pages and other random stuff in here that can throw off the count. Works that span multiple pages end up with pages that have the same metadata. We’ll just have to hardcode the exceptions.

We could also look at the Work length provided in OASIS. Combined with the sequence number, we could probably work out positions that way.

If there’s some other source of this information, that’d be sweet.

Position in Fascicle

One possible way to find Franklin’s suggested order of sheets is to get all Works with the same fascicle number and sort by Franklin number. Hopefully, sheet IDs will now be grouped (though not necessarily consecutive). We can walk through sheet ID groups and assign each a consecutive integer starting at 1. This is their Franklin order in the fascicle.

Page Notes

We should dump all the metadata from PDS into this field, at least temporarily, to help post-processing tasks.

Work Title

This will typically be the first line of the work, which we should grab from Franklin. PDS and OASIS also contain titles - hopefully they just agree with Franklin’s.

Work Date

This one is tricky. OASIS contains dates that look much more standardized (and therefore easier to parse) than Franklin, but Franklin has modifiers like “summer” that OASIS doesn’t seem to have.

Work Number

This exists in Franklin (obviously) and OASIS.

Start Pages

For each work, Franklin gives us the Franklin number, and for each image, PDS provides the Franklin number. For each Work, we simply find the Page with the lowest ID that contains the Franklin number. That page becomes our start page. This assumes IDs map roughly to sequence numbers (though IDs span all fascicles), but I believe this is a safe assumption.

End Pages

There are two solutions to this.

First, the Page on which a Work ends could be inferred from the page metadata contained in PDS. The last page of a work will be the last image in a sequence that contains the work’s Franklin number in its metadata. For most works, the end page will be the same as the start page.

Alternatively, the finding aid in OASIS contains approximate lengths for each Work.
We could just parse these lengths, round down, add to the start page, and we have our end page.

Transcription Body

This will be pulled from Franklin. Editorial notes about emendations and divisions and alternates, etc. will be stored in here.

We should store transcriptions as instances of a ruby class. This can save us a step if we want to output the transcriptions into XML or JSON or something else. We will just fill in the proper template.