Encoding and Decoding Strings (in Python 3.x)

In our other article, Encoding and Decoding Strings (in Python 2.x), we looked at how Python 2.x works with string encoding. Here we will look at encoding and decoding strings in Python 3.x, and how it is different.

Encoding/decoding strings in Python 3.x vs Python 2.x

Many things in Python 2.x did not change very drastically when the language branched off into the most current Python 3.x versions. The Python string is not one of those things, and in fact it is probably what changed most drastically. The changes it underwent are most evident in how strings are handled in encoding/decoding in Python 3.x as opposed to Python 2.x. Encoding and decoding strings in Python 2.x was somewhat of a chore, as you might have read in another article. Thankfully, turning 8-bit strings into unicode strings and vice-versa, and all the methods in between the two is forgotten in Python 3.x. Let’s examine what this means by going straight to some examples.

We’ll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Flügel'

Now if we reference and print the string, it gives us essentially the same result:

>>> s
'Flügel'
>>> print(s)
Flügel

In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn’t changed after we instantiated it.

Although our string value contains a non-ASCII character, it isn’t very far off from the ASCII character set, aka the Basic Latin set (in fact it’s part of the supplemental set to Basic Latin). What would happen if we have a character not only a non-ASCII character but a non-Latin character? Let’s try it:

>>> nonlat = '字'
>>> nonlat
'字'
>>> print(nonlat)
字

As we can see, it doesn’t matter whether it’s a string containing all Latin characters or otherwise, because strings in Python 3.x will all behave this way (and unlike in Python 2.x you can type any character into the IDLE window!).

If you have dealt with encoding and Decoding Strings in Python 2.x, you know that they can be a lot more troublesome to deal with, and that Python 3.x makes it much less painful. However, if we don’t need to use the unicode, encode, or decode methods or include multiple backslash escapes into our string variables to use them immediately, then what need do we have to encode or decode our Python 3.x strings? Before answering that question, we’ll first look at b'...' (bytes) objects in Python 3.x in contrast to the same in Python 2.x.

The Python 3.x bytes object

In Python 2.x, prefixing a string literal with a “b” (or “B”) is legal syntax, but it does nothing special:

>>> b'prefix in Python 2.x'
'prefix in Python 2.x'

In Python 3.x, however, this prefix indicates the string is a bytes object which differs from the normal string (which as we know is by default a Unicode string), and even the ‘b’ prefix is preserved:

>>> b'prefix in Python 3.x'
b'prefix in Python 3.x'

The thing about bytes objects is that they actually are arrays of integers, though we see them as ASCII characters. How or why they are arrays of integers is not of great importance to us at this point, but what is important is that we will only see them as a string of ASCII literal characters and they can only contain ASCII literal characters. Which is why the following won’t work (or with any non-ASCII characters):

>>> b'字'
SyntaxError: bytes can only contain ASCII literal characters.

Now to see how bytes objects relate to strings, let’s first look at how to turn a string into a bytes object and vice versa.

Converting Python strings to bytes, and bytes to strings

If we want to turn our nonlat string from before into a bytes object, we can use the bytes constructor method; however, if we only use the string as the sole argument we’ll get this error:

>>> bytes(nonlat)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding

As we can see, we need to include an encoding with the string. Let’s use a common one, the UTF-8 encoding:

>>> bytes(nonlat, 'utf-8')
b'\xe5\xad\x97'

Now we have our bytes object, encoded in UTF-8 … but what exactly does that mean? It means that the single character contained in our nonlat variable was effectively translated into a string of code that means “字” in UTF-8—in other words, it was encoded. Does this mean if we use an encode method call on nonlat, that we’ll get the same result? Let’s see:

>>> nonlat.encode()
b'\xe5\xad\x97'

Indeed we got the same result, but we did not have to give the encoding in this case because the encode method in Python 3.x uses the UTF-8 encoding by default. If we changed it to UTF-16, we’d have a different result:

>>> nonlat.encode('utf-16')
b'\xff\xfeW['

Though both calls perform the same function, they do it in slightly different ways depending on the encoding or codec.

Since we can encode strings to make bytes, we can also decode bytes to make strings—but when decoding a bytes object, we must know the correct codec to use to get the correct result. For example, if we try to use UTF-8 to decode a UTF-16-encoded version of nonlat above:

# We can use the method directly on the bytes
>>> b'\xff\xfeW['.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

And we get an error! Now if we use the correct codec it turns out fine:

>>> b'\xff\xfeW['.decode('utf-16')
'字'

In this case we were alerted by Python because of the failed decoding operation, but the caveat is that errors will not always occur when the codec is incorrect! This is because codecs often use the same code phrases (the “\xXXX” escapes that compose the bytes objects) but to represent different things! If we think of this in the context of human languages, using different codecs to encode and decode the same information would be like trying to translate a word or words from Spanish into English with an Italian-English dictionary—some of the phonemes in Italian and Spanish might be similar, but you’ll still be left with the wrong translation!

Writing non-ASCII Data to Files in Python 3.x

As a final note on strings in Python 3.x and Python 2.x, we must be sure to remember that using the open method for writing to files in both branches will not allow for Unicode strings (that contain non-ASCII characters) to be written to files. In order to do this the strings must be encoded.

This is no big deal in Python 2.x, as a string will only be Unicode if you make it so (by using the unicode method or str.decode), but in Python 3.x all strings are Unicode by default, so if we want to write such a string, e.g. nonlat, to file, we’d need to use str.encode and the wb (binary) mode for open to write the string to a file without causing an error, like so:

>>> with open('nonlat.txt', 'wb') as f:
    f.write(nonlat.encode())

Also when reading from a file with non-ASCII data, it’s important to use the rb mode and decode the data with the correct codec — unless of course you don’t mind having an “Italian” translation for your “Spanish.”

Encoding and Decoding Strings (in Python 3.x)

Encoding/decoding strings in Python 3.x vs Python 2.x

The Python 3.x bytes object

Converting Python strings to bytes, and bytes to strings

Writing non-ASCII Data to Files in Python 3.x

Leave a Reply Cancel reply