The Python Oracle

python: unicode problem

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: A Thousand Exotic Places Looping v001

--

Chapters
00:00 Python: Unicode Problem
01:14 Answer 1 Score 3
01:31 Accepted Answer Score 20
01:59 Answer 3 Score 11
02:10 Thank you

--

Full question
https://stackoverflow.com/questions/4735...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #unicode

#avk47



ACCEPTED ANSWER

Score 20


This looks like UTF-16 data. So try

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.




ANSWER 2

Score 11


This file is a UTF-16-LE encoded file, with an initial BOM.

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()



ANSWER 3

Score 3


EDIT

Since you posted 2.7 this is the 2.7 solution:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]