How to make a threaded scraper in Python

4 comments

DaveN has a good writeup of how he makes threaded scrapers - sorry - "Data collection applications" in Python. Although all the examples are written in Python, the design and rationales apply to just about every modern day programming language. The final example is pretty much a full framework for high-speed web scraping. Cheers Dave.

Comments

Thanks for the link jetboy -

Thanks for the link jetboy - oops!

Forgot one thing

Looks like he neglected to put in a default identifier of davidnaylor.co.uk with an email address of david @ davidnaylor.co.uk. :)

Thanks for that code though. Very nice to have for those of us who haven't gone through the process before.

O'Reilly has several books on the subject

A good grasp of spidering is extremely valuable in understanding search engine behavior, given that search engines are probably the most active spiders out on the net!

O'Reilly covers this topic quite well in their Spidering Hacks book. Their Google Hacks (including a forward by the Google Engineering Team) and Yahoo! Hacks contain additional examples specific to these search engines.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.