官术网_书友最值得收藏!

Setting a user agent

By default, urllib will download content with the Python-urllib/3.x user agent, where 3.x is the environment's current version of Python. It would be preferable to use an identifiable user agent in case problems occur with our web crawler. Also, some websites block this default user agent, perhaps after they have experienced a poorly made Python web crawler overloading their server. For example,  http://www.meetup.com/ currently returns a 403 Forbidden when requesting the page with urllib's default user agent.

To download sites reliably, we will need to have control over setting the user agent. Here is an updated version of our download function with the default user agent set to 'wswp' (which stands forWeb Scraping with Python):

def download(url, user_agent='wswp', num_retries=2): 
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
html = urllib.request.urlopen(request).read()
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html

If you now try meetup.com, you will see valid HTML. Our download function can now be reused in later code to catch errors, retry the site when possible, and set the user agent.

主站蜘蛛池模板: 句容市| 涞源县| 桓仁| 连山| 永兴县| 临朐县| 喀什市| 兴宁市| 维西| 华安县| 中西区| 浪卡子县| 延寿县| 合山市| 广平县| 山东省| 陆良县| 读书| 宁城县| 广德县| 新野县| 黔江区| 航空| 阿坝县| 那曲县| 云龙县| 比如县| 阳江市| 耒阳市| 三明市| 廊坊市| 建平县| 比如县| 枝江市| 定南县| 泗洪县| 图木舒克市| 大化| 平度市| 阿瓦提县| 囊谦县|