官术网_书友最值得收藏!

Estimating the size of a website

The size of the target website will affect how we crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. This problem is addressed later in Chapter 4 , Concurrent Downloading, on distributed downloading.

A quick way to estimate the size of a website is to check the results of Google's crawler, which has quite likely already crawled the website we are interested in. We can access this information through a Google search with the site keyword to filter the results to our domain. An interface to this and other advanced search parameters are available at http://www.google.com/advanced_search.

Here are the site search results for our example website when searching Google for site:example.webscraping.com:

As we can see, Google currently estimates more than 200 web pages (this result may vary), which is around the website size. For larger websites, Google's estimates may be less accurate.

We can filter these results to certain parts of the website by adding a URL path to the domain. Here are the results for site:example.webscraping.com/view, which restricts the site search to the country web pages:

Again, your results may vary in size; however, this additional filter is useful because ideally you only want to crawl the part of a website containing useful data rather than every page.

主站蜘蛛池模板: 聂拉木县| 凤台县| 驻马店市| 绥德县| 赤水市| 于都县| 阿拉善右旗| 张北县| 文登市| 乾安县| 石城县| 固原市| 新余市| 澄江县| 扎鲁特旗| 封开县| 三门县| 万年县| 新郑市| 华宁县| 嵩明县| 且末县| 湾仔区| 长子县| 大洼县| 仁寿县| 延津县| 巢湖市| 兴和县| 云林县| 巴中市| 茌平县| 辽阳县| 马边| 科技| 凤翔县| 永顺县| 嘉定区| 图木舒克市| 田林县| 夹江县|