UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
--
Track title: CC L Beethoven - Piano Sonata No 8 in C
--
Chapters
00:00 Question
01:10 Accepted answer (Score 424)
01:53 Answer 2 (Score 133)
02:23 Answer 3 (Score 77)
03:02 Answer 4 (Score 39)
03:28 Thank you
--
Full question
https://stackoverflow.com/questions/1246...
Accepted answer links:
http://docs.python.org/howto/unicode.htm...
[codecs]: https://docs.python.org/2/library/codecs...
Answer 3 links:
http://python-notes.curiousefficiency.or...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #linux #pythonunicode
#avk47
ACCEPTED ANSWER
Score 448
http://docs.python.org/howto/unicode.html#the-unicode-type
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
Note: This will strip out (ignore) the characters in question returning the string without them.
For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.
Alternatively: Use the open method from the codecs module to read in the file:
import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
errors='ignore') as fdata:
ANSWER 2
Score 136
Changing the engine from C to Python did the trick for me.
Engine is C:
pd.read_csv(gdp_path, sep='\t', engine='c')
'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
Engine is Python:
pd.read_csv(gdp_path, sep='\t', engine='python')
No errors for me.
ANSWER 3
Score 83
This type of issue crops up for me now that I've moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.
I found this nice explanation of the differences and how to find a solution after none of the above worked for me.
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
In short, to make Python 3 behave as similarly as possible to Python 2 use:
with open(filename, encoding="latin-1") as datafile:
# work on datafile here
However, read the article, there is no one size fits all solution.
ANSWER 4
Score 38
>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ