Web Crawler

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search

MediaCloud, a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined processing of downloaded files for example by using call backs to user scripts. The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The media cloud source code that is available on source forge might serve as a useful starting point.

Note: We've gotten a bunch of questions about using Python. Our strong preference would be for a language other than Python. While we agree that Python is a good language for writing a web crawler, unfortunately we don't have in house expertise in Python. We're very concerned that this would make it difficult for us to maintain the code base after the summer and that we would be unable to effectively mentor a Python project.