Read a unicode file in python which declares its encoding in the same way as python source

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC L Beethoven - Piano Sonata No 8 in C

--

Chapters
00:00 Question
02:51 Accepted answer (Score 7)
04:00 Answer 2 (Score 3)
08:58 Answer 3 (Score 2)
10:26 Answer 4 (Score 2)
11:46 Thank you

--

Full question
https://stackoverflow.com/questions/6078...

Question links:
https://www.python.org/dev/peps/pep-0263/

Accepted answer links:
[encodings like UTF-16]: http://en.wikipedia.org/wiki/UTF-16/UCS-...
[the byte order mark]: http://en.wikipedia.org/wiki/Byte_Order_...

Answer 2 links:
[PEP 0263]: http://www.python.org/dev/peps/pep-0263/

Answer 3 links:
[here]: https://github.com/alexmojaki/birdseye/b...
[here]: https://github.com/alexmojaki/birdseye/b...

Answer 4 links:
[PEP (0268)]: http://www.python.org/dev/peps/pep-0263/
[get_coding_spec]: http://hg.python.org/cpython/file/bf5b97...
[check_coding_spec]: http://hg.python.org/cpython/file/bf5b97...
[decoding_fgets]: http://hg.python.org/cpython/file/bf5b97...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #unicode

#avk47

ACCEPTED ANSWER

Score 7

You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.

If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match \x00c\x00o.. or the reverse, depending on the byte order mark. First, generate a few test files which advertise their encoding:

import codecs, sys
for encoding in ('utf-8', 'cp1252'):
    out = codecs.open('%s.txt' % encoding, 'w', encoding)
    out.write('# coding = %s\n' % encoding)
    out.write(u'\u201chello se\u00f1nor\u201d')
    out.close()

Then write the decoder:

import codecs, re

def open_detect(path):
    fin = open(path, 'rb')
    prefix = fin.read(80)
    encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix)
    encoding = encs[0] if encs else 'utf-8'
    fin.seek(0)
    return codecs.EncodedFile(fin, 'utf-8', encoding)

for path in ('utf-8.txt','cp1252.txt'):
    fin = open_detect(path)
    print repr(fin.readlines())

Output:

['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']

ANSWER 2

Score 3

I examined the sources of tokenizer.c (thanks to @Ninefingers for suggesting this in another answer and giving a link to the source browser). It seems that the exact algorithm used by Python is (equivalent to) the following. In various places I'll describe the algorithm as reading byte by byte---obviously one wants to do something buffered in practice, but it's easier to describe this way. The initial part of the file is processed as follows:

Upon opening a file, attempt to recognize the UTF-8 BOM at the beginning of the file. If you see it, eat it and make a note of the fact you saw it. Do not recognize the UTF-16 byte order mark.
Read 'a line' of text from the file. 'A line' is defined as follows: you keep reading bytes until you see one of the strings '\n', '\r' or '\r\n' (trying to match as long a string as possible---this means that if you see '\r' you have to speculatively read the next character, and if it's not a '\n', put it back). The terminator is included in the line, as is usual Python practice.
Decode this string using the UTF-8 codec. Unless you have seen the UTF-8 BOM, generate an error message if you see any non-ASCII characters (i.e. any characters above 127). (Python 3.0 does not, of course, generate an error here.) Pass this decoded line on to the user for processing.
Attempt to interpret this line as a comment containing a coding declaration, using the regexp in PEP 0263. If you find a coding declaration, skip to the instructions below for 'I found a coding declaration'.
OK, so you didn't find a coding declaration. Read another line from the input, using the same rules as in step 2 above.
Decode it, using the same rules as step 3, and pass it on to the user for processing.
Attempt again to interpred this line as a coding declaration comment, as in step 4. If you find one, skip to the instructions below for 'I found a coding declaration'.
OK. We've now checked the first two lines. According to PEP 0263, if there was going to be a coding declaration, it would have been on the first two lines, so we now know we're not going to see one. We now read the rest of the file using the same reading instructions as we used to read the first two lines: we read the lines using the rules in step 2, decode using the rules in step 3 (making an error if we see non-ASCII bytes unless we saw a BOM).

Now the rules for what to do when 'I found a coding declaration':

If we previously saw a UTF-8 BOM, check that the coding declaration says 'utf-8' in some form. Throw an error otherwise. (''utf-8' in some form' means anything which, after converting to lower case and converting underscores to hyphens, is either the literal string 'utf-8', or something beginning with 'utf-8-'.)
Read the rest of the file using the decoder associated to the given encoding in the Python codecs module. In particular, the division of the rest of the bytes in the file into lines is the job of the new encoding.
One final wrinkle: universal newline type stuff. The rules here are as follows. If the encoding is anything except 'utf-8' in some form or 'latin-1' in some form, do no universal-newline stuff at all; just pass out lines exactly as they come from the decoder in the codecs module. On the other hand, if the encoding is 'utf-8' in some form or 'latin-1' in some form, transform lines ending '\r' or '\r\n' into lines ending '\n'. (''utf-8' in some form' means the same as before. ''latin-1' in some form' means means anything which, after converting to lower case and converting underscores to hyphens, is one of the literal strings 'latin-1', 'iso-latin-1' or 'iso-8859-1', or any string beginning with one of 'latin-1-', 'iso-latin-1-' or 'iso-8859-1-'.

For what I'm doing, fidelity to Python's behaviour is important. My plan is to roll an implementation of the algorithm above in Python, and use this. Thanks for everyone who answered!

ANSWER 3

Score 2

From said PEP (0268):

Python's tokenizer/compiler combo will need to be updated to work as follows:

read the file

decode it into Unicode assuming a fixed per-file encoding

convert it into a UTF-8 byte string

tokenize the UTF-8 content

compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding

Indeed, if you check Parser/tokenizer.c in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.

It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't Py prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.

ANSWER 4

Score 1

Starting from Python 3.4 there is a function which allows you to do what you're asking for – importlib.util.decode_source

According to documentation:

importlib.util.decode_source(source_bytes)
Decode the given bytes representing source code and return it as a string with universal newlines (as required by importlib.abc.InspectLoader.get_source()).

Brett Cannon talks about this function in his talk From Source to Code: How CPython's Compiler Works.