官术网_书友最值得收藏!

Getting ready

We will use a small web site that is included in the www folder of the sample code.  To follow along, start a web server from within the www folder.  This can be done with Python 3 as follows:

www $ python3 -m http.server 8080
Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/) ...

The DOM of a web page can be examined in Chrome by right-clicking the page and selecting Inspect. This opens the Chrome Developer Tools. Open a browser page to http://localhost:8080/planets.html. Within chrome you can right click and select 'inspect' to open developer tools (other browsers have similar tools).



Selecting Inspect on the Page

This opens the developer tools and the inspector. The DOM can be examined in the Elements tab.

The following shows the selection of the first row in the table:

Inspecting the First Row

Each row of planets is within a <tr> element.  There are several characteristics of this element and its neighboring elements that we will examine because they are designed to model common web pages.

Firstly, this element has three attributes: id, planet, and name. Attributes are often important in scraping as they are commonly used to identify and locate data embedded in the HTML.

Secondly, the <tr> element has children, and in this case, five <td> elements. We will often need to look into the children of a specific element to find the actual data that is desired.

This element also has a parent element, <tbody>. There are also sibling elements, and the a set of <tr>  child elements.  From any planet, we can go up to the parent and find the other planets. And as we will see, we can use various constructs in the various tools, such as the find family of functions in Beautiful Soup, and also  XPath queries, to easily navigate these relationships.

主站蜘蛛池模板: 田东县| 阿图什市| 湖北省| 惠东县| 广平县| 揭阳市| 府谷县| 开平市| 读书| 东明县| 南宫市| 木兰县| 浏阳市| 玛沁县| 宣化县| 长岛县| 阿坝县| 东丽区| 寿阳县| 乌兰浩特市| 铜川市| 龙陵县| 双柏县| 三穗县| 建水县| 邻水| 陆川县| 百色市| 原阳县| 余姚市| 交城县| 油尖旺区| 施甸县| 鄂尔多斯市| 恭城| 玉门市| 新绛县| 名山县| 潜山县| 柞水县| 宜黄县|