Emily Dickinson Archive Data Importer Proposal

Here is our proposal for the import procedure. It is broken into 6 sections:

Data Needs
Data Sources
Process Overview
Field Sources
Open Questions
Notes

Nouns that correspond to models in the database are capitalized throughout.

Data Needs

The following objects would need to be imported for the first release:

Fascicles: the 40 collections of sheets
Sheets: the physical sheets of paper that make up the fascicles
Pages: the individually imaged pieces of paper (typically 4 per sheet)
Works: the independent bodies of content (multiple works per page or multiple pages per work)
Transcriptions: the machine-readable text of works
Annotations: user-contributed notes about specific portions of transcriptions
Sheet Scholarship: for import, we need Franklin’s suggested order of sheets in fascicles
Links: external links connected to works, pages, sheets, etc.

PageScholarship and FascicleScholarship would not necessarily need to be created as part of the import process.

Data Sources

PDS

PDS contains image URLs, Johnson numbers, Franklin numbers, sheet and sequence numbers, and work titles

PDS must be imported by crawling the pages and parsing the HTML.

OASIS

OASIS contains fascicle numbers, sheets, pages, Johnson numbers, Franklin numbers, HOLLIS call numbers, work titles, and dates

OASIS must be imported by crawling the page and parsing the HTML.

Franklin

Franklin contains Franklin numbers, dates, work titles, and transcriptions.

Franklin can be parsed by finding patterns in the rendering code of the PDFs.

Johnson

Johnson contains Johnson numbers and transcriptions.

Johnson could potentially be parsed by OCRing the PDFs and then finding patterns in the output, but this is questionable. It will most likely become a very manual process.

Process Overview

Automatically create the Collection object for the Harvard Library and the Scholarship objects for Franklin and Johnson.
Import Works, Transcriptions, and Annotations from the Franklin Variorum (and Johnson’s edition). At this point, we will be missing the start page and end page fields for Works.
Import Fascicles and Sheets from the OASIS finding aid. This includes loose sheets.
Import Pages and Links from PDS. Using the Franklin numbers present in the image metadata, we can fill in the start pages for Works. We should add all the page metadata to the notes field of PageScholarship so we can post-process. We can create Links from Works to the PDS record by parsing the “Cite this Resource” HTML.
Post-process our imported data to determine Franklin’s proposed sheet order within fascicles and the end pages of imported Works.
Mark Works for which we do not have Pages. This will prevent the works from showing up in the web UI, which would be useful because Pageless Works are works without associated images. We don’t want to remove the data entirely, as these records could still be useful in export formats.

Field Sources

Field	Source	Notes
Collection
Name	Automatically input	See Automatically Input
Location	Automatically input
Fascicle
Number	OASIS	See Fascicle Number
Sheet
Fascicle	OASIS	See Sheet’s Fascicle
Page
Collection ID	Automatically input
Sheet ID	Automatically input
Image URL	PDS	See Image URL
Position in Sheet	Automatically input	See Page Position in Sheet
Scholarship
Name	Automatically input
Author	Automatically input
Date	Automatically input
Work Number Prefix	Automatically input
Completeness	Automatically input
Owner	Automatically input
FascicleScholarship
Not part of import process
SheetScholarship
Sheet ID	Automatically input
Scholarship ID	Automatically input
Position in Fascicle	Franklin	See Position in Fascicle
Notes	Not part of import process
PageScholarship
Page ID	Automatically input
Scholarship ID	Automatically input
Notes	PDS	See Page Notes
Work
Title	Franklin	See Work Title
Date	OASIS or Franklin	See Work Date
Number	Franklin	See Work Number
Scholarship ID	Automatically input
Start Page ID	PDS and Franklin	See Start Pages
End Page ID	Franklin	See End Pages
Transcription
Work ID	Automatically input
Body	Franklin	See Transcription Body
Annotation
Not part of import process

Open Questions

Are fascicles subjective? The language in the finding aid suggests they are.
What are we doing with Works that exist in Franklin and Johnson but don’t exist in PDS, and vice versa (if applicable)?
Some of the sequence titles look incorrect.
Here’s an example. The first image looks like it matches the title of sequence 3, the second looks to match the fourth, the third the fifth, the fourth the sixth, the fifth the first (or second), and on and on. I really hope this isn’t a trend.
What are the ambitions around getting Amherst and BPL content in here? Importing Amherst’s records looks like it will be a very manual process.
Can we just use the dates in the OASIS finding aid, or do we want the prose description of dates contained in Franklin? Parsing Franklin is much harder.

Notes

Automatically Input

These values will either be input by the import process itself, or by a program that is designed to input specific data, rather than data scraped from another source. The various IDs are an example of the former, while the Harvard Library Collection is an example of the latter. By entering this data automatically, it will allow us to run the import process numerous times without requiring a user to enter any data.

Fascicle Number

Fascicles will be drawn from the container list in OASIS for the poem collection. This will ensure that fascicles for which we do not have images will not end up in the database.

Sheet’s Fascicle

We can create new sheets and find the fascicle that owns them by parsing the finding aid in OASIS. To get the fascicle for homogeneous packets, we have headings like “Packet IX = Fascicle 33”. For heterogeneous packets, there’s a little list at the top of each packet section that indicates which sheet numbers were assigned to which fascicles by Franklin.

New sheets should be created for each unique number (not number-letter pair) in parentheses and bold in the container list.

Image URL

We’ll have to talk with HLS OIS about how to handle this. We should definitely keep them hosting everything, but we should talk about utilizing their JPEG2000 viewer API so we can create various JPGs on the fly for users.

Page Position in Sheet

Page position in sheet can come from the number (not letter) in parentheses preceding the PDS sequence title. For each fascicle, there are usually sets of four images, where each set has the same number. If we just say the first in each set is a, the second is b, third is c, fourth is d, that would be mostly correct. (The letters are wrong because they break at poems, not pages)

There are issues with this method. There are images of slips and strings and blank pages and other random stuff in here that can throw off the count. Works that span multiple pages end up with pages that have the same metadata. We’ll just have to hardcode the exceptions.

We could also look at the Work length provided in OASIS. Combined with the sequence number, we could probably work out positions that way.

If there’s some other source of this information, that’d be sweet.

Position in Fascicle

One possible way to find Franklin’s suggested order of sheets is to get all Works with the same fascicle number and sort by Franklin number. Hopefully, sheet IDs will now be grouped (though not necessarily consecutive). We can walk through sheet ID groups and assign each a consecutive integer starting at 1. This is their Franklin order in the fascicle.

Page Notes

We should dump all the metadata from PDS into this field, at least temporarily, to help post-processing tasks.

Work Title

This will typically be the first line of the work, which we should grab from Franklin. PDS and OASIS also contain titles - hopefully they just agree with Franklin’s.

Work Date

This one is tricky. OASIS contains dates that look much more standardized (and therefore easier to parse) than Franklin, but Franklin has modifiers like “summer” that OASIS doesn’t seem to have.

Work Number

This exists in Franklin (obviously) and OASIS.

Start Pages

For each work, Franklin gives us the Franklin number, and for each image, PDS provides the Franklin number. For each Work, we simply find the Page with the lowest ID that contains the Franklin number. That page becomes our start page. This assumes IDs map roughly to sequence numbers (though IDs span all fascicles), but I believe this is a safe assumption.

End Pages

There are two solutions to this.

First, the Page on which a Work ends could be inferred from the page metadata contained in PDS. The last page of a work will be the last image in a sequence that contains the work’s Franklin number in its metadata. For most works, the end page will be the same as the start page.

Alternatively, the finding aid in OASIS contains approximate lengths for each Work.
We could just parse these lengths, round down, add to the start page, and we have our end page.

Transcription Body

This will be pulled from Franklin. Editorial notes about emendations and divisions and alternates, etc. will be stored in here.

We should store transcriptions as instances of a ruby class. This can save us a step if we want to output the transcriptions into XML or JSON or something else. We will just fill in the proper template.