UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC L Beethoven - Piano Sonata No 8 in C

--

Chapters
00:00 Question
01:10 Accepted answer (Score 424)
01:53 Answer 2 (Score 133)
02:23 Answer 3 (Score 77)
03:02 Answer 4 (Score 39)
03:28 Thank you

--

Full question
https://stackoverflow.com/questions/1246...

Accepted answer links:
http://docs.python.org/howto/unicode.htm...
[codecs]: https://docs.python.org/2/library/codecs...

Answer 3 links:
http://python-notes.curiousefficiency.or...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #linux #pythonunicode

#avk47

ACCEPTED ANSWER

str = unicode(str, errors='replace')

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c')

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

No errors for me.

This type of issue crops up for me now that I've moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.

>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ