官术网_书友最值得收藏!

Parsing robots.txt

First, we need to interpret robots.txt to avoid downloading blocked URLs. Python urllib comes with the robotparser module, which makes this straightforward, as follows:

    >>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True

The robotparser module loads a robots.txt file and then provides a can_fetch()function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser module says that this web page can not be fetched, as we saw in the definition in the example site's robots.txt.

To integrate robotparser into the link crawler, we first want to create a new function to return the  robotparser object:

def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp

We need to reliably set the robots_url; we can do so by passing an extra keyword argument to our function. We can also set a default value catch in case the user does not pass the variable. Assuming the crawl will start at the root of the site, we can simply add robots.txt to the end of the URL. We also need to define the user_agent:

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'):
...
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)

Finally, we add the parser check in the crawl loop:

... 
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
html = download(url, user_agent=user_agent)
...
else:
print('Blocked by robots.txt:', url)

We can test our advanced link crawler and its use of robotparser by using the bad user agent string.

>>> link_crawler('http://example.webscraping.com', '/(index|view)/', user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com
主站蜘蛛池模板: 彭州市| 昭觉县| 孟州市| 化州市| 敦煌市| 南漳县| 九江市| 黑龙江省| 广西| 武城县| 达拉特旗| 石首市| 古浪县| 福安市| 永新县| 得荣县| 阜康市| 晋江市| 天水市| 鹤山市| 重庆市| 洛浦县| 定襄县| 如皋市| 铁力市| 阜南县| 平定县| 庆元县| 博兴县| 眉山市| 钦州市| 孝昌县| 安达市| 台东县| 开原市| 报价| 米脂县| 肥东县| 崇信县| 青川县| 贵溪市|