Library Innovation Lab: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Two potential projects:
STRUCTURED SITE SCRAPER: From site to collection


 
Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and would be  imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.
1. Syllabus parser. Design, structure and populate an open repository of the information in college syllabi. [Note that this project will be done in conjunction with the Harvard Library Innovation Lab.]
 
*Assuming we get permission, figure out how to retrieve syllabi from Google. (If we don't get permission, we have a starter set of 500,000+ syllabi.)
*Figure out how to parse the multiple and free-form formats syllabi are found in.
*Design an appropriate and open data model for the information in syllabi.
*Build a Web site with that provides useful end-user and API access to the syllabus data.
 
 
2. Scholarly semantic web builder. The aim is to crawl the Google Books corpus looking for useful relationships among scholarly works. Such relationships only begin with citations/footnotes. What other semantic cues can be unearthed to see how scholarly books relate? [Note that this project will be done in conjunction with the Harvard Library Innovation Lab.]
 
*Research the sorts of relations between books that would be of high value to scholars and researchers, in addition to footnotes.
*Crawl the Google Books corpus to discover these relations [if Google gives us permission]
*Make these relations accessible in an open way, especially in conjunction with the ShelfLife app that provides community-based wayfaring through Harvard Library's holdings for scholars and researchers.
*Create interesting and understandable analytics based on the discovered relationships.

Revision as of 12:38, 8 March 2012

STRUCTURED SITE SCRAPER: From site to collection

Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and would be imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.