The Python Oracle

python: unicode problem

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Dream Voyager Looping

--

Chapters
00:00 Question
01:22 Accepted answer (Score 19)
01:57 Answer 2 (Score 10)
02:14 Answer 3 (Score 3)
02:38 Thank you

--

Full question
https://stackoverflow.com/questions/4735...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #unicode

#avk47



ACCEPTED ANSWER

Score 20


This looks like UTF-16 data. So try

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.




ANSWER 2

Score 11


This file is a UTF-16-LE encoded file, with an initial BOM.

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()



ANSWER 3

Score 3


EDIT

Since you posted 2.7 this is the 2.7 solution:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]