How to determine the encoding of text
--
Music by Eric Matyas
https://www.soundimage.org
Track title: Horror Game Menu Looping
--
Chapters
00:00 Question
00:30 Accepted answer (Score 268)
02:51 Answer 2 (Score 92)
03:47 Answer 3 (Score 45)
04:27 Answer 4 (Score 38)
05:12 Thank you
--
Full question
https://stackoverflow.com/questions/4362...
Question links:
[How can I detect the encoding/codepage of a text file]: https://stackoverflow.com/questions/9083...
Accepted answer links:
https://pypi.org/project/charset-normali.../
[chardet]: http://pypi.python.org/pypi/chardet
[UnicodeDammit]: https://www.crummy.com/software/Beautifu...
[chardet]: http://pypi.python.org/pypi/chardet
Answer 2 links:
[libmagic]: http://linux.die.net/man/3/libmagic
[file]: http://linux.die.net/man/1/file
[python-magic]: http://packages.debian.org/search?keywor...
[python3-magic]: http://packages.debian.org/search?keywor...
[python-magic]: https://pypi.python.org/pypi/python-magi...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #encoding #textfiles
#avk47
ACCEPTED ANSWER
Score 288
EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
You can also use UnicodeDammit. It will try the following methods:
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
ANSWER 2
Score 101
Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of python bindings available.
The python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. It can determine the encoding of a file by doing:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob) # "utf-8" "us-ascii" etc
There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)
ANSWER 3
Score 47
Some encoding strategies, please uncomment to taste :
#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile
You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :
# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
try:
fh = codecs.open('file.txt', 'r', encoding=e)
fh.readlines()
fh.seek(0)
except UnicodeDecodeError:
print('got unicode error with %s , trying different encoding' % e)
else:
print('opening the file with encoding: %s ' % e)
break
ANSWER 4
Score 41
Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.
chardet also gives you a probability (i.e. confidence) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.
import chardet
from pathlib import Path
def predict_encoding(file_path: Path, n_lines: int=20) -> str:
'''Predict a file's encoding using chardet'''
# Open the file as binary data
with Path(file_path).open('rb') as f:
# Join binary lines for specified number of lines
rawdata = b''.join([f.readline() for _ in range(n_lines)])
return chardet.detect(rawdata)['encoding']