Web Crawler: Difference between revisions

Revision as of 16:20, 18 March 2010

MediaCloud, a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined processing of downloaded files for example by using call backs to user scripts. The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The media cloud source code that is available on source forge might serve as a useful starting point.

Web Crawler: Difference between revisions

Revision as of 16:20, 18 March 2010

Navigation menu

Search