The Open Library

Name: The Open Library
Start: 2007-10-23T12:30:00-0400
End: 2007-10-23T13:30:00-0400
Location: Harvard Berkman Klein Center Event

Aaron Swartz, The Open Library

Aaron Swartz

Aaron Swartz, co-founder of Reddit.com, spoke at the luncheon series about the Open Library.

Thanks to new technology, the grand vision of a library containing every book in the world is now within our grasp. The Open Library Project, a loose collection of technologists, publishers, librarians, and book-lovers, has taken up this challenge by trying to create a website collecting everything we know about books — including library records, publishers’ blurbs, full-text and scans, reviews, and more. Learn about the vision, the technology, the progress, and how you can join us.

About Aaron

Aaron Swartz is the Tech Lead for Open Library. He was previously a co-founder of Reddit.com, which was purchased by Condé Nast in late 2006. He was worked on Internet specifications for RSS and RDF and was one of the early team members of the Creative Commons project. He is the author of a number of free software packages and a co-founder of Jottit.com.

Links

+ Open Library demo

+ Open Library vision

+ Aaron Swartz’s website

Download original audio or video from this event.

Subscribe to the Berkman Klein events series podcast.

Q + A

Berkman Intern Yvette Wohn sat down with Aaron to talk about his current project and the metadata of scanned books, the differences between the Open Library and Google Book Search, and how he finds inspiration for creativity.

Also, be sure to check out this article in The New York Times discussing the increasing aversion of major research libraries to accept Google's restrictive terms for book scanning, who instead are signing on to have their materials made available through Brewster Kahle's Open Content Alliance, which runs the Open Library project.

Q.You are working on creating the Open Library, an online database that contains all the world's books for anyone to view. What kind of significance does this project have in our age?

I think this is one of the grand challenges for the Internet era. The vast majority of our cultural legacy is in books -- books are where you go when you have a great idea you want to share, a great story you want to publish, a classic piece of art -- it's hard to understate their importance. But in an era where few people venture beyond a search engine when doing research, making these resources more available for Internet users a crucial challenge.

Q.Will there be some kind of Dewey Decimal system for this library? Will that catalog/ index system support foreign languages?

In a traditional library, a book can only be at one place on the shelf, so picking a cataloging system like Dewey Decimal is necessary. But online, you can support as many category systems as you like. So we'll have the Dewey Decimal System, the Library of Congress system, various subjects, tags, searches -- anything people want to use, we'll try to make available. And of course we want to support as many foreign countries and languages as possible.

Q.Full-text requires a lot of manual labor. Do you see possibilities of scans being able to encode the image content into text?

Right now what we do is we have humans turning pages before cameras to scan the book, then we use software to convert these scans into text. It isn't perfect, but it's very good. Here's a random sentence from a random book: "To punish himself for the too great nicety which he had formerly had in the care of his hands and feet, he now resolved to neglect them." Not bad -- I don't see any errors, actually. There's also a site, distributed proofreaders, which verifies these scans by hand and then publishes the results as plain-text books.

Q.Will text contain embedded hyperlinks? If so, who will manually input this data?

We hope to eventually allow people to highlight and annotate portions of a book, adding additional comments or linking off to additional information.

Q.Do you think publishers are threatened by the Open Library concept as record labels are with free music file sharing?

Not at all; the publishers have been extremely helpful. We're only scanning books that are out-of-copyright, which is a small minority of what's being published. For the other books, we're working with publishers and libraries to make them available in every way we know how: through purchasing online or in a local bookstore, picking up a copy in a library or through digital interlibrary loan, etc. We're helping publishers get the word out about their books.

Q.In a June conference we hosted at Harvard Law School, a group on Universities and libraries explored the changing roles of libraries. How does the Open Library fit into the context of these changes? Do we need to build physical libraries in the future?

There will always be a need for housing physical books and assisting people with research. But I think libraries can seize upon new technologies to increase their relevance. When visiting Google is so much easier than visiting the library, they need to enhance their online presences to make them more attractive to Internet surfers. That's part of what we're doing at Open Library: we're trying to give books a place in Google that can then be used to point users toward local libraries.

Q.How is the Open Library project different from Google's project to make an online book archive?

Google is scanning a lot of books but they're not doing much more than just putting them online. We want to do much more than that and become the hub for information about books. We let users come to our site and add more information to their favorite books, find the books in various places, browser and sort by categories, and so on.

Q.As is with Wikipedia, is there any possibility of some people with malicious intent altering content? Is there some kind of gatekeeping or check system regarding content? (Since you are a hacker yourself, you must realize the chances of information hacking)

Sure, there's always the possibility of malicious people. Our plan is to watch things carefully and add safeguards as they become necessary.

Q.How do you decide which texts to digitalize first and how are you dealing with copyright issues?

Right now the digitization done by the Internet Archive is funded by a variety of foundations and companies and its those organizations that decide which books to scan. Only out-of-copyright books are scanned.

Q.How much time are you spending on the Open Library project in context to other programming projects that you do? Could you name another project (or give us a link) that you are currently working on-- preferably a private one?

Open Library takes up the majority of my time but I'm also working with my friend Simon Carstensen on a new start, http://jottit.com/ -- a really easy way to make a website.

Q.What do you do when you're jammed? How do you find inspiration for creativity?

It's a bit embarrassing to say this, but when I find myself a bit burnt-out I like to spend a day just watching some good TV shows. It's a bit of a guilty pleasure, but I always feel much better after doing it.

Q.This year is Berkman's 10th anniversary. How do you think the project you're working on will evolve in 10 years? What do you think the Internet will look like in 10 years in reference to libraries, digital information, and scholarship?

Ten years is a long time for a website. I'd like to think that Open Library will be one of the cornerstones of Internet culture, a recognized site like Wikipedia or Craigslist, a place people go to do research and find interesting things.

Past Event Tuesday, October 23, 2007

Time
12:30 PM - 1:30 PM