官术网_书友最值得收藏!

Throttling downloads

If we crawl a website too quickly, we risk being blocked or overloading the server(s). To minimize these risks, we can throttle our crawl by waiting for a set delay between downloads. Here is a class to implement this:

from urllib.parse import urlparse
import time

class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}

def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)

if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (time.time() - last_accessed)
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = time.time()

This Throttle class keeps track of when each domain was last accessed and will sleep if the time since the last access is shorter than the specified delay. We can add throttling to the crawler by calling throttle before every download:

throttle = Throttle(delay) 
...
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries,
proxy=proxy, charset=charset)
主站蜘蛛池模板: 鹤庆县| 麻栗坡县| 应用必备| 兴安盟| 象山县| 荆门市| 资中县| 三原县| 望城县| 米易县| 邵东县| 东平县| 阜城县| 大化| 平顺县| 达孜县| 彭泽县| 涿鹿县| 南城县| 彩票| 中山市| 通化县| 定兴县| 吉水县| 贵德县| 巢湖市| 璧山县| 临江市| 洛阳市| 英吉沙县| 巫溪县| 喜德县| 夏河县| 延寿县| 新干县| 岑溪市| 阳谷县| 绿春县| 寿光市| 姜堰市| 饶阳县|