官术网_书友最值得收藏!

Converting between bytes and str

To convert between bytes and str we must know the encoding of the byte sequence used to represent the string's Unicode code points as bytes. Python supports a wide-variety of so-called codecs such as UTF-8, UTF-16, ASCII, Latin-1, Windows-1251, and so on – consult the Python documentation for a current list of codecs

In Python we can encode a Unicode str into a bytes object, and going the other way we can decode a bytes object into a Unicode str. In either direction it's up to us to specify the encoding. Python won't — and generally speaking can't do anything to prevent you erroneously decoding UTF-16 data stored in a bytes object using, say, a CP037 codec for handling strings on legacy IBM mainframes.

If you're lucky the decoding will fail with a UnicodeError at runtime; if you're unlucky you'll wind up with a str full of garbage that will go undetected by your program.

Figure 2.2: Encoding and Decoding.

Let's kick off an interactive session looking at strings, with an interesting Unicode string which contains all the characters of the 29 letter Norwegian alphabet – a pangram:

>>> norsk = "Jeg begynte ? fort?re en sandwich mens jeg kj?rte taxi p? vei til quiz"

We'll now encode that using the UTF-8 codec into a bytes object using the encode() method of the str object:

>>> data = norsk.encode('utf-8')
>>> data
b'Jeg begynte \xc3\xa5 fort\xc3\xa6re en sandwich mens jeg kj\xc3\xb8rte taxi p\xc3\xa5 vei til quiz'

See how each of the Norwegian letters has been rendered as a pair of bytes.

We can reverse the process using the decode() method of the bytes object. Again, it is up to us to supply the correct encoding:

>>> norwegian = data.decode('utf-8')

We can check that the encoding/decoding round-trip gives us a result equal to what we started with:

>>> norwegian == norsk
True

Let's try to display it for good measure:

>>> norwegian
'Jeg begynte ? fort?re en sandwich mens jeg kj?rte taxi p? vei til quiz'

All this messing about with encodings may seem like unnecessary detail at this juncture – especially if you operate in an anglophone environment – but it's crucial to understand since files and network resources such as HTTP responses are transmitted as byte streams, whereas we prefer to work with the convenience of Unicode strings.

String differences between Python 3 and Python 2
The biggest difference between contemporary Python 3 and legacy Python 2 is the handling of strings. In versions of Python up to and including Python 2 the str type was a so-called byte string, where each character was encoded as a single byte. In this sense, Python 2 str was similar to the Python 3 bytes, however, the interface presented by str and bytes is in fact different in significant ways. In particular their constructors are completely different and indexing into a bytes object returns an integer rather than a single code point string. To confuse matters further, there is also a bytes type in Python 2.6 and Python 2.7, but this is just a synonym for str and as such has an identical interface. If you're writing text handling code intended to be portable across Python 2 and Python 3 – which is perfectly possible – tread carefully!
主站蜘蛛池模板: 民和| 饶河县| 三门峡市| 开远市| 体育| 永定县| 聂拉木县| 宽城| 喀喇沁旗| 集贤县| 交城县| 秦安县| 广安市| 靖安县| 龙游县| 图们市| 白河县| 巴林右旗| 澄城县| 嘉定区| 广昌县| 湘乡市| 潞西市| 横峰县| 长泰县| 香格里拉县| 北碚区| 酉阳| 平昌县| 封丘县| 宁晋县| 西贡区| 五原县| 余干县| 交城县| 贺州市| 岐山县| 九寨沟县| 崇信县| 仙居县| 息烽县|