- Python Web Scraping Cookbook
- Michael Heydt
- 190字
- 2021-06-30 18:44:04
How to do it...
We will look at using urlopen and requests to handle HTML in UTF-8. These two libraries handle this differently, so let's examine this. Let's start importing urllib, loading the page, and examining some of the content.
In [8]: from urllib.request import urlopen
...: page = urlopen("http://localhost:8080/unicode.html")
...: content = page.read()
...: content[840:1280]
...:
Out[8]: b'><strong>Cyrillic</strong> U+0400 \xe2\x80\x93 U+04FF (1024\xe2\x80\x931279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50"> </td>\n <td class="b" width="50">\xd0\x89</td>\n <td class="b" width="50">\xd0\xa9</td>\n <td class="b" width="50">\xd1\x89</td>\n <td class="b" width="50">\xd3\x83</td>\n </tr>\n </tbody>\n </table>\n\n '
Note how the Cyrillic characters were read in as multi-byte codes using \ notation, such as \xd0\x89.
To rectify this, we can convert the content to UTF-8 format using the Python str statement:
In [9]: str(content, "utf-8")[837:1270]
Out[9]: '<strong>Cyrillic</strong> U+0400 – U+04FF (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50"> </td>\n <td class="b" width="50">?</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">?</td>\n </tr>\n </tbody>\n </table>\n\n '
Note that the output now has the characters encoded properly.
We can exclude this extra step by using requests.
In [9]: import requests
...: response = requests.get("http://localhost:8080/unicode.html").text
...: response.text[837:1270]
...:
'<strong>Cyrillic</strong> U+0400 – U+04FF (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50"> </td>\n <td class="b" width="50">?</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">?</td>\n </tr>\n </tbody>\n </table>\n\n '
推薦閱讀
- Cisco OSPF命令與配置手冊
- 物聯網(IoT)基礎:網絡技術+協議+用例
- Hands-On Industrial Internet of Things
- 網絡故障現場處理實踐(第4版)
- 物聯網時代
- 夢工廠之材質N次方:Maya材質手冊
- Getting Started with Memcached
- Implementing NetScaler VPX?
- 精通SEO:100%網站流量提升密碼
- 智能物聯網:區塊鏈與霧計算融合應用詳解
- INSTANT Social Media Marketing with HootSuite
- 趣話通信:6G的前世、今生和未來
- 黑客心理學:社會工程學原理
- 網絡基本通信約束下的系統性能極限分析與設計
- 網絡空間作戰:機理與籌劃