官术网_书友最值得收藏!

Final version

The full source code for this advanced link crawler can be downloaded at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler.py. Each of the sections in this chapter has matching code in the repository at https://github.com/kjam/wswp. To easily follow along, feel free to fork the repository and use it to compare and test your own code.

To test the link crawler, let's try setting the user agent to BadCrawler, which, as we saw earlier in this chapter, was blocked by robots.txt. As expected, the crawl is blocked and finishes immediately:

    >>> start_url = 'http://example.webscraping.com/index'
>>> link_regex = '/(index|view)'
>>> link_crawler(start_url, link_regex, user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com/

Now, let's try using the default user agent and setting the maximum depth to 1 so that only the links from the home page are downloaded:

    >>> link_crawler(start_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1

As expected, the crawl stopped after downloading the first page of countries.

主站蜘蛛池模板: 云梦县| 台南县| 大同县| 新余市| 定州市| 泽普县| 北票市| 津市市| 柳河县| 株洲市| 西乌珠穆沁旗| 长顺县| 利辛县| 尼勒克县| 开原市| 陇西县| 宝鸡市| 包头市| 临泉县| 左贡县| 桃江县| 武威市| 濮阳市| 东阿县| 进贤县| 东乌珠穆沁旗| 玉山县| 台北县| 彰武县| 罗源县| 汾阳市| 红原县| 南汇区| 兴义市| 西畴县| 沈丘县| 颍上县| 贵南县| 贡嘎县| 白沙| 兴义市|