Digital Public Library of America

Structured Site Scraper: From site to collection

Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and would be imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.

Mentors: self@evident.com, jclark@cyber.law.harvard.edu

General Questions: berkmancenterharvard@gmail.com

Library Item Matching Service

The Digital Public Library of America software platform is gathering metadata about items in collections in libraries, museums, archives, and online cultural collections. Many of these items have identifiers in various standard namespaces such as ISBN numbers, OCLC identifiers, and Open Library IDs. The DPLA platform would like to offer a service through its API by which developers could query with the information they have about a particular item and have returned to them any or all of the identifiers known to the DPLA. If the developer has one of the standard IDs, then it will just take a table lookup to find the others, although this might require accessing the API of other such services, such as OCLC.org's. The problem becomes more difficult when the query does not include an identification number, but does include other metadata such as author, title, publisher, year, etc. Then the matching will be probabilistic since records often vary in these details, or are incomplete, or the query may contain errors or variations. This project would consist of building a useful service that takes all this into account and returns results along with a numeric expression of the degree of confidence the system has in the results.

Find more information about the Digital Public Library of America at: dp.la

Mentor: mphillips@law.harvard.edu

General Questions: berkmancenterharvard@gmail.com

Digital Public Library of America

Structured Site Scraper: From site to collection

Library Item Matching Service

Navigation menu

Search