Judge Python str include Chinese

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Isolated

--

Chapters
00:00 Judge Python Str Include Chinese
00:30 Accepted Answer Score 5
00:52 Answer 2 Score 2
01:11 Answer 3 Score 3
02:18 Thank you

--

Full question
https://stackoverflow.com/questions/1944...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python

#avk47

ACCEPTED ANSWER

Score 5

Check for the range of unicode characters to find out if a character in string belongs to chinese characters or not. A google search tells me all chinese characters fall between '\u4e00' and u'\u9fff'. You may want to verify that yourself.

def check_contain_chinese(check_str):
    for ch in check_str.decode('utf-8'):
        if u'\u4e00' <= ch <= u'\u9fff':
            return True
    return False

ANSWER 2

Score 3

All existing answers here confused the CJK (which represents Chinese, Japanese, and Korean) characters with Han characters(which only represents Chinese) characters.

It's easy tell whether a character is CJK but harder to tell whether a character is Chinese and the standard is changing, new characters are being added always.

But in practice, people usually use u'\u4e00' - u'\u9fa5' to check whether a character. CJK characters out of that range usually can not be displayed by common Chinese fonts.

Sometimes CJK Radicals Supplement, Bopomofo, CJK Strokes should also be treated as characters, and they are not even in the CJK Unified Ideographs('\u4e00'- u'\u9fff'), but they are common and important in the Chinese writing system.

Reference:

CJK characters

CJK Unified Ideographs

Unihan Database Lookup

GB 2312 to Unicode

GB 12345 to Unicode

ANSWER 3

Score 2

There is six Unicode maps for Chinese characters. Just check if code of any character in your string fits the 0x4E00 - 0x9FFF interval:

>>> any(0x4E00 <= ord(x) <= 0x9FFF for x in u'xx中国')
1: True
>>> any(0x4E00 <= ord(x) <= 0x9FFF for x in u'xxx')
2: False