Web Crawler: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
(Add sentence describing specific mediacloud needs)
No edit summary
Line 1: Line 1:
[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined  
[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined  
processing of download files for example by using call backs to user scripts.  
processing of downloaded files for example by using call backs to user scripts.  
The crawler should be multithreaded and ideally would allow crawler processes to  
The crawler should be multithreaded and ideally would allow crawler processes to  
run on multiple machines using a centralized queue. It is also important that  
run on multiple machines using a centralized queue. It is also important that  
the crawler be implemented in a highly reusable fashion, with straightforward  
the crawler be implemented in a highly reusable fashion, with straightforward  
installation and limited dependence on external libraries or system  
installation and limited dependence on external libraries or system  
architecture. The media cloud source code is available on source forge might  
architecture. The media cloud source code that is available on source forge might  
serve as a useful starting point.
serve as a useful starting point.

Revision as of 17:20, 18 March 2010

MediaCloud, a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined processing of downloaded files for example by using call backs to user scripts. The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The media cloud source code that is available on source forge might serve as a useful starting point.