Here is our proposal for the import procedure. It is broken into 6 sections:
Nouns that correspond to models in the database are capitalized throughout.
The following objects would need to be imported for the first release:
PageScholarship and FascicleScholarship would not necessarily need to be created as part of the import process.
PDS contains image URLs, Johnson numbers, Franklin numbers, sheet and sequence numbers, and work titles
PDS must be imported by crawling the pages and parsing the HTML.
OASIS contains fascicle numbers, sheets, pages, Johnson numbers, Franklin numbers, HOLLIS call numbers, work titles, and dates
OASIS must be imported by crawling the page and parsing the HTML.
Franklin contains Franklin numbers, dates, work titles, and transcriptions.
Franklin can be parsed by finding patterns in the rendering code of the PDFs.
Johnson contains Johnson numbers and transcriptions.
Johnson could potentially be parsed by OCRing the PDFs and then finding patterns in the output, but this is questionable. It will most likely become a very manual process.
Automatically create the Collection object for the Harvard Library and the Scholarship objects for Franklin and Johnson.
Import Works, Transcriptions, and Annotations from the Franklin Variorum (and Johnson’s edition). At this point, we will be missing the start page and end page fields for Works.
Import Fascicles and Sheets from the OASIS finding aid. This includes loose sheets.
Import Pages and Links from PDS. Using the Franklin numbers present in the image metadata, we can fill in the start pages for Works. We should add all the page metadata to the notes field of PageScholarship so we can post-process. We can create Links from Works to the PDS record by parsing the “Cite this Resource” HTML.
Post-process our imported data to determine Franklin’s proposed sheet order within fascicles and the end pages of imported Works.
Mark Works for which we do not have Pages. This will prevent the works from showing up in the web UI, which would be useful because Pageless Works are works without associated images. We don’t want to remove the data entirely, as these records could still be useful in export formats.
| Field | Source | Notes |
|---|---|---|
| Collection | ||
| Name | Automatically input | See Automatically Input |
| Location | Automatically input | |
| Fascicle | ||
| Number | OASIS | See Fascicle Number |
| Sheet | ||
| Fascicle | OASIS | See Sheet’s Fascicle |
| Page | ||
| Collection ID | Automatically input | |
| Sheet ID | Automatically input | |
| Image URL | PDS | See Image URL |
| Position in Sheet | Automatically input | See Page Position in Sheet |
| Scholarship | ||
| Name | Automatically input | |
| Author | Automatically input | |
| Date | Automatically input | |
| Work Number Prefix | Automatically input | |
| Completeness | Automatically input | |
| Owner | Automatically input | |
| FascicleScholarship | ||
| Not part of import process | ||
| SheetScholarship | ||
| Sheet ID | Automatically input | |
| Scholarship ID | Automatically input | |
| Position in Fascicle | Franklin | See Position in Fascicle |
| Notes | Not part of import process | |
| PageScholarship | ||
| Page ID | Automatically input | |
| Scholarship ID | Automatically input | |
| Notes | PDS | See Page Notes |
| Work | ||
| Title | Franklin | See Work Title |
| Date | OASIS or Franklin | See Work Date |
| Number | Franklin | See Work Number |
| Scholarship ID | Automatically input | |
| Start Page ID | PDS and Franklin | See Start Pages |
| End Page ID | Franklin | See End Pages |
| Transcription | ||
| Work ID | Automatically input | |
| Body | Franklin | See Transcription Body |
| Annotation | ||
| Not part of import process | ||
Are fascicles subjective? The language in the finding aid suggests they are.
What are we doing with Works that exist in Franklin and Johnson but don’t exist in PDS, and vice versa (if applicable)?
Some of the sequence titles look incorrect.
Here’s an example.
The first image looks like it matches the title of sequence 3,
the second looks to match the fourth, the third the fifth, the fourth the
sixth, the fifth the first (or second), and on and on. I really hope this
isn’t a trend.
What are the ambitions around getting Amherst and BPL content in here? Importing Amherst’s records looks like it will be a very manual process.
Can we just use the dates in the OASIS finding aid, or do we want the prose description of dates contained in Franklin? Parsing Franklin is much harder.
These values will either be input by the import process itself, or by a program that is designed to input specific data, rather than data scraped from another source. The various IDs are an example of the former, while the Harvard Library Collection is an example of the latter. By entering this data automatically, it will allow us to run the import process numerous times without requiring a user to enter any data.
Fascicles will be drawn from the container list in OASIS for the poem collection. This will ensure that fascicles for which we do not have images will not end up in the database.
We can create new sheets and find the fascicle that owns them by parsing the finding aid in OASIS. To get the fascicle for homogeneous packets, we have headings like “Packet IX = Fascicle 33”. For heterogeneous packets, there’s a little list at the top of each packet section that indicates which sheet numbers were assigned to which fascicles by Franklin.
New sheets should be created for each unique number (not number-letter pair) in parentheses and bold in the container list.
We’ll have to talk with HLS OIS about how to handle this. We should definitely keep them hosting everything, but we should talk about utilizing their JPEG2000 viewer API so we can create various JPGs on the fly for users.
Page position in sheet can come from the number (not letter) in parentheses preceding the PDS sequence title. For each fascicle, there are usually sets of four images, where each set has the same number. If we just say the first in each set is a, the second is b, third is c, fourth is d, that would be mostly correct. (The letters are wrong because they break at poems, not pages)
There are issues with this method. There are images of slips and strings and blank pages and other random stuff in here that can throw off the count. Works that span multiple pages end up with pages that have the same metadata. We’ll just have to hardcode the exceptions.
We could also look at the Work length provided in OASIS. Combined with the sequence number, we could probably work out positions that way.
If there’s some other source of this information, that’d be sweet.
One possible way to find Franklin’s suggested order of sheets is to get all Works with the same fascicle number and sort by Franklin number. Hopefully, sheet IDs will now be grouped (though not necessarily consecutive). We can walk through sheet ID groups and assign each a consecutive integer starting at 1. This is their Franklin order in the fascicle.
We should dump all the metadata from PDS into this field, at least temporarily, to help post-processing tasks.
This will typically be the first line of the work, which we should grab from Franklin. PDS and OASIS also contain titles - hopefully they just agree with Franklin’s.
This one is tricky. OASIS contains dates that look much more standardized (and therefore easier to parse) than Franklin, but Franklin has modifiers like “summer” that OASIS doesn’t seem to have.
This exists in Franklin (obviously) and OASIS.
For each work, Franklin gives us the Franklin number, and for each image, PDS provides the Franklin number. For each Work, we simply find the Page with the lowest ID that contains the Franklin number. That page becomes our start page. This assumes IDs map roughly to sequence numbers (though IDs span all fascicles), but I believe this is a safe assumption.
There are two solutions to this.
First, the Page on which a Work ends could be inferred from the page metadata contained in PDS. The last page of a work will be the last image in a sequence that contains the work’s Franklin number in its metadata. For most works, the end page will be the same as the start page.
Alternatively, the finding aid in OASIS contains approximate lengths for each Work.
We could just parse these lengths, round down, add to the start page, and we
have our end page.
This will be pulled from Franklin. Editorial notes about emendations and divisions and alternates, etc. will be stored in here.
We should store transcriptions as instances of a ruby class. This can save us a step if we want to output the transcriptions into XML or JSON or something else. We will just fill in the proper template.