The Python Oracle

Convert bytes to a string

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Techno Bleepage Open

--

Chapters
00:00 Convert Bytes To A String
00:39 Accepted Answer Score 5765
00:59 Answer 2 Score 424
01:23 Answer 3 Score 264
01:33 Answer 4 Score 131
03:29 Thank you

--

Full question
https://stackoverflow.com/questions/6061...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #string #python3x

#avk47



ACCEPTED ANSWER

Score 5765


Decode the bytes object to produce a string:

>>> b"abcde".decode("utf-8")
'abcde'

The above example assumes that the bytes object is in UTF-8, because it is a common encoding. However, you should use the encoding your data is actually in!




ANSWER 2

Score 424


Decode the byte string and turn it in to a character (Unicode) string.


Python 3:

encoding = 'utf-8'
b'hello'.decode(encoding)

or

str(b'hello', encoding)

Python 2:

encoding = 'utf-8'
'hello'.decode(encoding)

or

unicode('hello', encoding)



ANSWER 3

Score 264


This joins together a list of bytes into a string:

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'



ANSWER 4

Score 131


If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English characters are not translated, because they match in most single byte encodings and UTF-8).

Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

The same applies to latin-1, which was popular (the default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

UPDATE 20150604: There are rumors that Python 3 has the surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, [binary] -> [str] -> [binary], to validate both performance and reliability.

UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

See Python’s Unicode Support for details.

UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower than the cp437 solution, but it should produce identical results on every Python version.

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))