Welcome to Chinese-forums.com
Since 2003 we've been helping people learn Chinese, study and work in China, find Chinese books, movies or music. We are active, friendly and helpful. Check out recent and popular posts on the home page, see the full forums listing or sign up for free now.
Member quotes:
"This forum is a goldmine of information, and I'm so glad it's here!"
"...the kindest, most interesting and most useful help."
"...a godsend!"
Popular Posts
- 4Legal part-time work (ESL, chiefly) on CSC/Confucius Institute scholarhsips (X1/X2 visa holders)
roddy - 16 Oct 2016 - 20:08 - 2Introducing Chinese Text Analyser
imron - 17 Oct 2016 - 14:37 - 2好, 是, 對 with 啊, 吧, 的, 了
roddy - 15 Oct 2016 - 21:12 - 2LTL Mandarin School with Chinese homestays
zander1 - 14 Oct 2016 - 19:30 - 2CSC 2016 Applications
Lolo - 13 Oct 2016 - 16:03
Developer question - detecting hanzi in unicode string
#1
Posted 08 September 2009 - 02:16 PM
If I have a unicode string, how can I detect if a character is Hanzi or not?
Site Sponsors
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.
Put your message here
#3
Posted 08 September 2009 - 02:48 PM
I'm still rather confused about the whole thing. Is it enough to take into account these ranges:
Unified CJK Ideographs
CJK Ideographs Ext. A
CJK Ideographs Ext. B
I just want to detect traditional and simplified chars that appear in real life (in chinese).
#4
Posted 08 September 2009 - 02:59 PM
Also, you may want to also click the "symbols and punctuation" link, and include CJK punctuation.
#5
Posted 08 September 2009 - 03:30 PM
Maybe eventually I'll allow hanzi and pinyin to be mixed together - but this is probably for a later version...
Anyway, so I may as well check more ranges rather than less, because speed is not important in this part of the execution.
#6
Posted 08 September 2009 - 03:44 PM
#7
Posted 08 September 2009 - 10:20 PM
Python:
# http://en.wikipedia.org/wiki/CJK_Unified_Ideographs cjkUnifiedIdeographs = u'u4E00-u9FFF' cjkCompatibilityIdeographs = u'uF900-uFAFF' cjkUnifiedIdeographsExtA = u'u3400-u4DBF' cjkUnifiedIdeographsExtB = u'u20000-2A6DF' #not sure if correct syntax, but not using anyway cjkEnclosedLettersAndMonths = u'u3200-u32FF' # Non-CJK characters used in simplified/traditional field of CC-CEDICT # I added these in by trial and error # Some of these are covered in code range "Halfwidth and Fullwidth Forms". But this makes a stricter filter cjkMiddleDot = u'u30FB' cjkFullwidthComma = u'uFF0C' cjkLingZero = u'u3007' cjkFullwidthLatin = u'uFF21-uFF3A' cjkRegexp = u'[%s%s%s%s%s%s%s]' % (cjkMiddleDot, cjkFullwidthComma, cjkLingZero, cjkUnifiedIdeographsExtA, cjkUnifiedIdeographs, cjkCompatibilityIdeographs, cjkFullwidthLatin)
Every simplified/traditional entry in CC-CEDICT, except for a few rare Unicode variants, is covered by
- cjkMiddleDot
- cjkFullwidthComma
- cjkLingZero
- cjkUnifiedIdeographs
- cjkCompatibilityIdeographs (for just a few variants)
- cjkUnifiedIdeographsExtA (for just a few variants)
- cjkFullwidthLatin
Perl also has a module that aliases code ranges; for example:
use charnames ':full'; #use friendly 'InCJKUnifiedIdeographs' for Chinese pattern match
$cjkMiddleDot = 'x{30FB}';
$cjkFullwidthComma = 'x{FF0C}';
$cjkLingZero = 'x{3007}';
$cjkFullwidthLatin = 'x{FF21}-x{FF3A}';
$CJK_regexp = '[' . join('',
'p{InCJKUnifiedIdeographs}',
'p{InCJKUnifiedIdeographsExtensionA}',
'p{InCJKCompatibilityIdeographs}',
$cjkMiddleDot,
$cjkFullwidthComma,
$cjkLingZero,
$cjkFullwidthLatin
) . ']';
# result: [p{InCJKUnifiedIdeographs}p{InCJKUnifiedIdeographsExtensionA}p{InCJKCompatibilityIdeographs}x{30FB}x{FF0C}x{3007}x{FF21}-x{FF3A}]
Mix and max the ranges used, for example if you want to include Latin fullwidth letters or punctuation.
#8
Posted 08 September 2009 - 10:46 PM
I was under the impression that the compatibility ideographs were there to help with conversions to/from older standards but that the main and extended CJK ideograph range contained an exact same version of these characters just with a different code-point.
I wonder if it's a hang-over from when CEDICT wasn't stored as unicode and was then converted? If so, it might be worth finding the corresponding ideograph in the main ranges and submitting a patch to CEDICT. The main reason being that IMEs don't seem to output the codepoints for the duplicated compatibility characters (preferring instead to use the codepoint from the main CJK ideographs) meaning that searches for that character that come from user input would fail.
#9
Posted 08 September 2009 - 11:35 PM
CJK Compatibility Ideographs
蘭 蘭 [lan2] /Unicode compatibility variant for 蘭/orchid/
盧 盧 [lu2] /Korean variant of 盧|卢/
老 老 [lao3] /unicode compatibility variant of 老/
不 不 [bu4] /variant of 不/(negative prefix)/not/no/
練 練 [lian4] /variant of 練|练, to practice/to train/to drill/to perfect (one's skill)/exercise/
識 識 [shi2] /Unicode compatibility variant of 識|识/
兀 兀 [wu4] /duplicate of Big Five A461/
Almost all of these were in CEDICT at least at the time it was imported into CC-CEDICT, but so were the entries for the corresponding canonical character. So it wasn't a conversion error, rather just extra entries with marginal usefulness. Even if they can't be entered from an IME, they can still be copy-pasted, so they're not completely inaccessible, if someone ever encountered the character and needed to look it up.
CJK Unified Ideographs Extension A
㑇 㑇 [zhou4] /beautiful/
㑳 㑳 [zhou4] /beautiful/
㗂 㗂{u+35c2} [sheng3] /variant of 省/tight-lipped/to examine/to watch/to scour (esp. Cantonese)/
㝵 㝵 [de2] /to obtain/archaic variant of 得|得[de2]/component in 礙|碍[ai4] and 鍀|锝[de2]/
㥁 㥁{u+3941} [de2] /variant of 德, ethics/
㨗 㨗{u+3a17} [jie2] /variant of 捷/quick/nimble/
㬎 㬎 [xian3] /old variant of 顯|显[xian3]/visible/apparent/
㶸 㶸{u+3db8} [xie2] /(precise meaning unknown, relates to iron)/variant of 劦 or of 協|协/
㺵 㺵{u+3eb5} [jiu2] /black jade/variant of 玖/
䯝 䯝{u+4bdd} [sui3] /variant of 髓/marrow/essence/quintessence/pith (soft interior of plant stem)/
喎僻不遂 㖞僻不遂 [wai1 pi4 bu4 sui2] /facial paralysis and hemiplegia after apoplexy (idiom)/
奕訢 奕䜣 [Yi4 xin1] /Grand Prince Yixin (1833-1898), sixth son of Emperor Daoguang, prominent politician, diplomat and modernizer in late Qing/
恭親王奕訢 恭亲王奕䜣 [Gong1 qin1 wang2 Yi4 xin1] /Grand Prince Yixin (1833-1898), sixth son of Emperor Daoguang, prominent politician, diplomat and modernizer in late Qing/
綵 䌽{u+433d} [cai3] /variant of 彩/(bright) color/variety/multicolored silk/motley/variegated/
訢 䜣 [xin1] /pleased/delighted/happy/variant of 欣/
These are more recent entries, and appear to be from user submissions. I would guess they are from classical texts or fanciful writing.
#10
Posted 09 September 2009 - 02:51 PM
- parse the CEDICT dictionary and build list of Chinese characters contained within. (see my post http://www.chinese-f...-own-char-lists)
- store sorted instance of list in program
- search list for character, return true/false.
I would actually prefer this over any other implementation. Chinese characters are those characters used in written Chinese, as opposed to just 'Asian' characters.
Yeah you could do a range comparison, but it doesn't look to me that Chinese occupy one large contiguous block. (there's gaps in the codes)
I've attached a list I've just generated. That came from CEDICT and the large sample sentence file referenced in my post above. Doing both gets me another 2 characters over CEDICT alone. Total of 12,402 characters. Note this is with the default Hanzi regex filter on, there seems to be a few odd ones that come before 一, such as 䜣, 㑇, 㑳. Very odd, depends how perfect you want it to be. How many Chinese people know more than 10,000 characters? Also Japanese is contained in the CEDICT dict.
Attached Files
#11
Posted 09 September 2009 - 03:05 PM
They don't. As mentioned in my first post:but it doesn't look to me that Chinese occupy one large contiguous block.
i.e there are several different contiguous ranges and you need to check them all if you want to determine if the character is Chinese. These ranges cover all ideographs common to Chinese, Japanese and Korean (CJK).All CJK characters fall within a number of contiguous ranges
The reason checking these ranges is preferable to the method you propose is:
A) it's significantly faster and uses significantly less memory to check a few ranges than it does to search against a list of known characters (important characteristics for mobile application).
#12
Posted 09 September 2009 - 04:14 PM
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users






