官术网_书友最值得收藏!

Sitemap crawler

For our first simple crawler, we will use the sitemap discovered in the example website's robots.txt to download all the web pages. To parse the sitemap, we will use a simple regular expression to extract URLs within the <loc> tags.

We will need to update our code to handle encoding conversions as our current download function simply returns bytes. Note that a more robust parsing approach called CSS selectors will be introduced in the next chapter. Here is our first example crawler:

import re

def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
return html

def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
# ...

Now, we can run the sitemap crawler to download all countries from the example website:

    >>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...

As shown in our download method above, we had to update the character encoding to utilize regular expressions with the website response. The Python read method on the response will return bytes, and the re module expects a string. Our code depends on the website maintainer to include the proper character encoding in the response headers. If the character encoding header is not returned, we default to UTF-8 and hope for the best. Of course, this decoding will throw an error if either the header encoding returned is incorrect or if the encoding is not set and also not UTF-8. There are some more complex ways to guess encoding (see: https://pypi.python.org/pypi/chardet), which are fairly easy to implement.

For now, the Sitemap crawler works as expected. But as discussed earlier, Sitemap files often cannot be relied on to provide links to every web page. In the next section, another simple crawler will be introduced that does not depend on the Sitemap file.

If you don't want to continue the crawl at any time you can hit Ctrl + C or c md + C to exit the Python interpreter or program execution.
主站蜘蛛池模板: 石狮市| 英超| 南丹县| 玛沁县| 德保县| 庄河市| 青川县| 长治市| 云霄县| 洪江市| 台北市| 荔浦县| 沧源| 肥城市| 汉源县| 高邮市| 新晃| 桃园市| 祁连县| 永年县| 龙海市| 讷河市| 衡南县| 新和县| 大理市| 海口市| 光泽县| 乌审旗| 大化| 福清市| 连城县| 桑日县| 安福县| 萝北县| 保山市| 临邑县| 迁西县| 丰顺县| 通州区| 安新县| 仪陇县|