書名： The Python Apprentice
作者名： Robert Smallshire Austin Bingham
本章字數： 562字
更新時間： 2021-07-02 22:16:58

Converting between bytes and str

To convert between bytes and str we must know the encoding of the byte sequence used to represent the string's Unicode code points as bytes. Python supports a wide-variety of so-called codecs such as UTF-8, UTF-16, ASCII, Latin-1, Windows-1251, and so on – consult the Python documentation for a current list of codecs

In Python we can encode a Unicode str into a bytes object, and going the other way we can decode a bytes object into a Unicode str. In either direction it's up to us to specify the encoding. Python won't — and generally speaking can't do anything to prevent you erroneously decoding UTF-16 data stored in a bytes object using, say, a CP037 codec for handling strings on legacy IBM mainframes.

If you're lucky the decoding will fail with a UnicodeError at runtime; if you're unlucky you'll wind up with a str full of garbage that will go undetected by your program.

Figure 2.2: Encoding and Decoding.

Let's kick off an interactive session looking at strings, with an interesting Unicode string which contains all the characters of the 29 letter Norwegian alphabet – a pangram:

>>> norsk = "Jeg begynte ? fort?re en sandwich mens jeg kj?rte taxi p? vei til quiz"

We'll now encode that using the UTF-8 codec into a bytes object using the encode() method of the str object:

>>> data = norsk.encode('utf-8')
>>> data
b'Jeg begynte \xc3\xa5 fort\xc3\xa6re en sandwich mens jeg kj\xc3\xb8rte taxi p\xc3\xa5 vei til quiz'

See how each of the Norwegian letters has been rendered as a pair of bytes.

We can reverse the process using the decode() method of the bytes object. Again, it is up to us to supply the correct encoding:

>>> norwegian = data.decode('utf-8')

We can check that the encoding/decoding round-trip gives us a result equal to what we started with:

>>> norwegian == norsk
True

Let's try to display it for good measure:

>>> norwegian
'Jeg begynte ? fort?re en sandwich mens jeg kj?rte taxi p? vei til quiz'

All this messing about with encodings may seem like unnecessary detail at this juncture – especially if you operate in an anglophone environment – but it's crucial to understand since files and network resources such as HTTP responses are transmitted as byte streams, whereas we prefer to work with the convenience of Unicode strings.

String differences between Python 3 and Python 2
The biggest difference between contemporary Python 3 and legacy Python 2 is the handling of strings. In versions of Python up to and including Python 2 the str type was a so-called byte string, where each character was encoded as a single byte. In this sense, Python 2 str was similar to the Python 3 bytes, however, the interface presented by str and bytes is in fact different in significant ways. In particular their constructors are completely different and indexing into a bytes object returns an integer rather than a single code point string. To confuse matters further, there is also a bytes type in Python 2.6 and Python 2.7, but this is just a synonym for str and as such has an identical interface. If you're writing text handling code intended to be portable across Python 2 and Python 3 – which is perfectly possible – tread carefully!

官术网_书友最值得收藏!

The Python Apprentice

Converting between bytes and str