官术网_书友最值得收藏!

How to do it...

This recipe, and most of the others in this chapter, will be presented with iPython in an interactive manner.  But all of the code for each is available in a script file.  The code for this recipe is in 02/01_parsing_html_wtih_bs.py. You can type the following in, or cut and paste from the script file.

Now let's walk through parsing HTML with Beautiful Soup. We start by loading this page into a BeautifulSoup object using the following code, which creates a BeautifulSoup object, loads the content of the page using with requests.get, and loads it into a variable named soup.

In [1]: import requests
...: from bs4 import BeautifulSoup
...: html = requests.get("http://localhost:8080/planets.html").text
...: soup = BeautifulSoup(html, "lxml")
...:

The HTML in the soup object can be retrieved by converting it to a string (most BeautifulSoup objects have this characteristic).  This following shows the first 1000 characters of the HTML in the document:

In [2]: str(soup)[:1000]
Out[2]: '<html>\n<head>\n</head>\n<body>\n<div id="planets">\n<h1>Planetary data</h1>\n<div id="content">Here are some interesting facts about the planets in our solar system</div>\n<p></p>\n<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Name\r\n </th>\n<th>\r\n Mass (10^24kg)\r\n </th>\n<th>\r\n Diameter (km)\r\n </th>\n<th>\r\n How it got its Name\r\n </th>\n<th>\r\n More Info\r\n </th>\n</tr>\n<tr class="planet" id="planet1" name="Mercury">\n<td>\n<img src="img/mercury-150x150.png"/>\n</td>\n<td>\r\n Mercury\r\n </td>\n<td>\r\n 0.330\r\n </td>\n<td>\r\n 4879\r\n </td>\n<td>Named Mercurius by the Romans because it appears to move so swiftly.</td>\n<td>\n<a >Wikipedia</a>\n</td>\n</tr>\n<tr class="p'

We can navigate the elements in the DOM using properties of soup. soup represents the overall document and we can drill into the document by chaining the tag names. The following navigates to the <table> containing the data:

In [3]: str(soup.html.body.div.table)[:200]
Out[3]: '<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Name\r\n </th>\n<th>\r\n Mass (10^24kg)\r\n </th>\n<th>\r\n '

The following retrieves the the first child <tr> of the table:

In [6]: soup.html.body.div.table.tr
Out[6]: <tr id="planetHeader">
<th>
</th>
<th>
Name
</th>
<th>
Mass (10^24kg)
</th>
<th>
Diameter (km)
</th>
<th>
How it got its Name
</th>
<th>
More Info
</th>
</tr>

Note this type of notation retrieves only the first child of that type.  Finding more requires iterations of all the children, which we will do next, or using the find methods (the next recipe).

Each node has both children and descendants. Descendants are all the nodes underneath a given node (event at further levels than the immediate children), while children are those that are a first level descendant. The following retrieves the children of the table, which is actually a list_iterator object:

In [4]: soup.html.body.div.table.children
Out[4]: <list_iterator at 0x10eb11cc0>

We can examine each child element in the iterator using a for loop or a Python generator. The following uses a generator to get all the children of the and return the first few characters of their constituent HTML as a list:

In [5]: [str(c)[:45] for c in soup.html.body.div.table.children]
Out[5]:
['\n',
'<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n ',
'\n',
'<tr class="planet" id="planet1" name="Mercury',
'\n',
'<tr class="planet" id="planet2" name="Venus">',
'\n',
'<tr class="planet" id="planet3" name="Earth">',
'\n',
'<tr class="planet" id="planet4" name="Mars">\n',
'\n',
'<tr class="planet" id="planet5" name="Jupiter',
'\n',
'<tr class="planet" id="planet6" name="Saturn"',
'\n',
'<tr class="planet" id="planet7" name="Uranus"',
'\n',
'<tr class="planet" id="planet8" name="Neptune',
'\n',
'<tr class="planet" id="planet9" name="Pluto">',
'\n']

Last but not least, the parent of a node can be found using the .parent property:

In [7]: str(soup.html.body.div.table.tr.parent)[:200]
Out[7]: '<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Name\r\n </th>\n<th>\r\n Mass (10^24kg)\r\n </th>\n<th>\r\n '
主站蜘蛛池模板: 宜丰县| 高淳县| 五峰| 雷州市| 尚志市| 平乡县| 长治县| 游戏| 黔西县| 滨海县| 昌宁县| 镇赉县| 称多县| 万宁市| 始兴县| 固原市| 安远县| 甘肃省| 南丰县| 漳平市| 德钦县| 乌拉特中旗| 承德市| 厦门市| 长垣县| 巴青县| 驻马店市| 高台县| 金秀| 锡林浩特市| 黄平县| 天津市| 台南县| 涿鹿县| 镇沅| 淳化县| 佛坪县| 乌苏市| 吉林省| 布尔津县| 平定县|