Library Innovation Lab: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 5: Line 5:


Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and  would be  imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.
Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and  would be  imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.
==Long Tail Browser==
The Harvard Library system has available a rich set of metadata about the 12 million books and other items in its collection. This includes "event" data such as circulation records broken down by school and borrower type, which works have been called back from loans early, items on reserve, items ordered by the Harvard Coop, and more. It isn't hard to come up with useful ranking algorithms that employ  this data. It is more challenging to devise interestingness algorithms. And it is more challenging still to devise algorithms that will find interesting and relevant works in the long tail. We are therefore proposing a project that explores using every scrap of metadata to provide search results based on interestingness, and that discerns interestingness in items in the long tail.

Revision as of 14:10, 9 March 2012

About

The Library Innovation Lab is a small group within the Harvard University Library system that implements in software ideas about how libraries can be ever more valuable. We hack libraries...in the good sense of discovering and delivering more capability and value. Find more information on the Library Innovation Lab here: librarylab.law.harvard.edu

STRUCTURED SITE SCRAPER: From site to collection

Many local libraries, historical societies, and cultural groups have created web sites displaying collections of digitized photos, scanned documents, oral histories, audio files, etc. Frequently these local treasures are on sites designed purely with end-user browsing in mind. They would be far more useful if they were more widely searchable and browsable. The team developing the Digital Public Library of America's software platform -- a metadata server -- would like to be able to gather metadata about such sites, discovering the heritage items they point to, capturing as much of the explicit metadata as possible (captions, labels, etc.), and using the structure of the site as a heuristic for parsing the collection's structure. This metadata would then be assimilated into the appropriate schema and would be imported into the DPLA's meta-catalog. The local curators would first be shown the data as parsed so they can make corrections to the content and structure. In addition, a site map would be generated for the local curators.

Long Tail Browser

The Harvard Library system has available a rich set of metadata about the 12 million books and other items in its collection. This includes "event" data such as circulation records broken down by school and borrower type, which works have been called back from loans early, items on reserve, items ordered by the Harvard Coop, and more. It isn't hard to come up with useful ranking algorithms that employ this data. It is more challenging to devise interestingness algorithms. And it is more challenging still to devise algorithms that will find interesting and relevant works in the long tail. We are therefore proposing a project that explores using every scrap of metadata to provide search results based on interestingness, and that discerns interestingness in items in the long tail.