Digital Public Library of America: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
No edit summary
(old project template)
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{Template:Oldproject}}
== Structured Site Scraper: From site to collection ==
Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and  would be  imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.
Mentors: [mailto:self@evident.com self@evident.com], [mailto:jclark@cyber.law.harvard.edu jclark@cyber.law.harvard.edu]
General Questions: [mailto:berkmancenterharvard@gmail.com berkmancenterharvard@gmail.com]
==Library Item Matching Service==
==Library Item Matching Service==



Latest revision as of 09:49, 18 March 2019

This page is for an old project that is not be part of Google Summer of Code currently. If you are a student looking for projects to get involved with we suggest you check out the projects linked to from the main page of this wiki.

Structured Site Scraper: From site to collection

Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and would be imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.


Mentors: self@evident.com, jclark@cyber.law.harvard.edu

General Questions: berkmancenterharvard@gmail.com


Library Item Matching Service

The Digital Public Library of America software platform is gathering metadata about items in collections in libraries, museums, archives, and online cultural collections. Many of these items have identifiers in various standard namespaces such as ISBN numbers, OCLC identifiers, and Open Library IDs. The DPLA platform would like to offer a service through its API by which developers could query with the information they have about a particular item and have returned to them any or all of the identifiers known to the DPLA. If the developer has one of the standard IDs, then it will just take a table lookup to find the others, although this might require accessing the API of other such services, such as OCLC.org's. The problem becomes more difficult when the query does not include an identification number, but does include other metadata such as author, title, publisher, year, etc. Then the matching will be probabilistic since records often vary in these details, or are incomplete, or the query may contain errors or variations. This project would consist of building a useful service that takes all this into account and returns results along with a numeric expression of the degree of confidence the system has in the results.

Find more information about the Digital Public Library of America at: dp.la


Mentor: mphillips@law.harvard.edu

General Questions: berkmancenterharvard@gmail.com