- Python Web Scraping(Second Edition)
- Katharine Jarmul Richard Lawson
- 168字
- 2021-07-09 19:42:46
Final version
The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.
To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:
>>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/
Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:
>>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1
As expected, the crawl stopped after downloading the first page of countries.
- Spring Boot開發與測試實戰
- Node.js Design Patterns
- 單片機C語言程序設計實訓100例:基于STC8051+Proteus仿真與實戰
- C和C++安全編碼(原書第2版)
- Mastering QGIS
- Python測試開發入門與實踐
- Mastering PHP Design Patterns
- C語言最佳實踐
- OpenStack Cloud Computing Cookbook(Fourth Edition)
- Java程序設計與實踐教程(第2版)
- Oracle從入門到精通(第5版)
- Learning Hunk
- 零基礎輕松學SQL Server 2016
- C# 8.0核心技術指南(原書第8版)
- Apache Kafka Quick Start Guide