Web Crawler

From Berkman Klein Google Summer of Code Wiki
Revision as of 16:51, 26 February 2010 by WikiSysop (talk | contribs) (New page: We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to speci...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.