官术网_书友最值得收藏!

Using the requests library

Although we have built a fairly advanced parser using only urllib, the majority of scrapers written in Python today utilize the requests library to manage complex HTTP requests. What started as a small library to help wrap urllib features in something "human-readable" is now a very large project with hundreds of contributors. Some of the features available include built-in handling of encoding, important updates to SSL and security, as well as easy handling of POST requests, JSON, cookies, and proxies.

Throughout most of this book, we will utilize the requests library for its simplicity and ease of use, and because it has become the de facto standard for most web scraping.

To install requests, simply use pip:

pip install requests

For an in-depth overview of all features, you should read the documentation at http://python-requests.org or browse the source code at https://github.com/kennethreitz/requests

To compare differences using the two libraries, I've also built the advanced link crawler so that it can use requests. You can see the code at https://github.com/kjam/wswp/blob/master/code/chp1/advanced_link_crawler_using_requests.py. The main download function shows the key differences. The requests version is as follows:

def download(url, user_agent='wswp', num_retries=2, proxies=None):
print('Downloading:', url)
headers = {'User-Agent': user_agent}
try:
resp = requests.get(url, headers=headers, proxies=proxies)
html = resp.text
if resp.status_code >= 400:
print('Download error:', resp.text)
html = None
if num_retries and 500 <= resp.status_code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries - 1)
except requests.exceptions.RequestException as e:
print('Download error:', e.reason)
html = None

One notable difference is the ease of use of having status_code as an available attribute for each request. Additionally, we no longer need to test for character encoding, as the text attribute on our Response object does so automatically. In the rare case of an non-resolvable URL or timeout, they are all handled by RequestException so it makes for an easy catch statement. Proxy handling is also taken care of by simply passing a dictionary of proxies (that is {'http': 'http://myproxy.net:1234', 'https': 'https://myproxy.net:1234'}).

We will continue to compare and use both libraries, so that you are familiar with them depending on your needs and use case. I strongly recommend using requests whenever you are handling more complex websites, or need to handle important humanizing methods such as using cookies or sessions. We will talk more about these methods in Chapter 6Interacting with Forms.

主站蜘蛛池模板: 和顺县| 富源县| 江山市| 龙海市| 南和县| 曲沃县| 天门市| 北碚区| 车险| 石楼县| 绿春县| 尚志市| 九江市| 化州市| 信宜市| 仪陇县| 屏山县| 仁化县| 宁武县| 刚察县| 博乐市| 彩票| 莱州市| 玉田县| 塘沽区| 昌吉市| 兴业县| 贵港市| 湖南省| 涟水县| 正蓝旗| 山阳县| 衡南县| 万宁市| 吴旗县| 榆社县| 宁海县| 唐海县| 昭通市| 界首市| 舞阳县|