Web Crawler - Revision history

Geeks at 18:27, 29 March 2010

2010-03-29T18:27:07Z

← Older revision		Revision as of 18:27, 29 March 2010
Line 7:		Line 7:
	architecture. The media cloud source code that is available on source forge might		architecture. The media cloud source code that is available on source forge might
	serve as a useful starting point.		serve as a useful starting point.

			Note: We've gotten a bunch of questions about using Python. Our strong preference would be for a language other than Python. While we agree that Python is a good language for writing a web
			crawler, unfortunately we don't have in house expertise in Python.
			We're very concerned that this would make it difficult for us to maintain
			the code base after the summer and that we would be unable to
			effectively mentor a Python project.

Geeks at 21:20, 18 March 2010

2010-03-18T21:20:12Z

← Older revision		Revision as of 21:20, 18 March 2010
Line 1:		Line 1:
	[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined		[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined
	processing of ~~download~~ files for example by using call backs to user scripts.		processing of downloaded files for example by using call backs to user scripts.
	The crawler should be multithreaded and ideally would allow crawler processes to		The crawler should be multithreaded and ideally would allow crawler processes to
	run on multiple machines using a centralized queue. It is also important that		run on multiple machines using a centralized queue. It is also important that
	the crawler be implemented in a highly reusable fashion, with straightforward		the crawler be implemented in a highly reusable fashion, with straightforward
	installation and limited dependence on external libraries or system		installation and limited dependence on external libraries or system
	architecture. The media cloud source code is available on source forge might		architecture. The media cloud source code that is available on source forge might
	serve as a useful starting point.		serve as a useful starting point.

Geeks: Add sentence describing specific mediacloud needs

2010-03-18T21:07:22Z

Add sentence describing specific mediacloud needs

← Older revision		Revision as of 21:07, 18 March 2010
Line 1:		Line 1:
	[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.		[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture. The crawler would also need some way to enable user defined
			processing of download files for example by using call backs to user scripts.
			The crawler should be multithreaded and ideally would allow crawler processes to
			run on multiple machines using a centralized queue. It is also important that
			the crawler be implemented in a highly reusable fashion, with straightforward
			installation and limited dependence on external libraries or system
			architecture. The media cloud source code is available on source forge might
			serve as a useful starting point.

Geeks: Add stop badware link

2010-03-18T21:04:34Z

Add stop badware link

← Older revision		Revision as of 21:04, 18 March 2010
Line 1:		Line 1:
	[[MediaCloud]], a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.		[[MediaCloud]], a Berkman Center project, and [http://stopbadware.org/ StopBadware], a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.

Geeks at 21:00, 18 March 2010

2010-03-18T21:00:00Z

← Older revision		Revision as of 21:00, 18 March 2010
Line 1:		Line 1:
	MediaCloud, a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.		[[MediaCloud]], a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.

Geeks: Readd intro sentence

2010-03-18T20:57:23Z

Readd intro sentence

← Older revision		Revision as of 20:57, 18 March 2010
Line 1:		Line 1:
	We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.		MediaCloud, a Berkman Center project, and StopBadware, a former Berkman Center project that has spun off as an independent organization, have each built systems to crawl websites and save the results into a database. We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.

WikiSysop: New page: We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to speci...

2010-02-26T20:51:51Z

New page: We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to speci...

New page

We would like someone to build upon the work done by these two projects to create a flexible, programmable, and scalable web crawler. This web crawler would have to allow the user to specify the URLs to be crawled, the depth of recursive crawling, any filters that should be applied (e.g., use of a particular referrer, or limitation to particular file extensions), the identifier string to be used for requests, and other parameters. It would also return results in a way that allows flexibility in how the results will be stored (e.g., mysql, postgres, flat files). The crawler should be multithreaded and ideally would allow crawler processes to run on multiple machines using a centralized queue. It is also important that the crawler be implemented in a highly reusable fashion, with straightforward installation and limited dependence on external libraries or system architecture.