Jump to content


Welcome to Chinese-forums.com


Since 2003 we've been helping people learn Chinese, study and work in China, find Chinese books, movies or music. We are active, friendly and helpful. Check out recent and popular posts on the home page, see the full forums listing or sign up for free now.

Member quotes:

"This forum is a goldmine of information, and I'm so glad it's here!"
"...the kindest, most interesting and most useful help."
"...a godsend!"
Learn Chinese in China

Photo
- - - - -

Developer question - detecting hanzi in unicode string


  • Please log in to reply
11 replies to this topic

#1 share westmeadboy

westmeadboy
  • user photo
  • Members
  • 103 posts
  • Location:China

Posted 08 September 2009 - 02:16 PM

Hello fellow developers.

If I have a unicode string, how can I detect if a character is Hanzi or not?
  • 0

Site Sponsors

Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.
Put your message here

#2 share imron

imron

    Admin

  • user photo
  • Administrators
  • 10,900 posts
  • Location:国外

Posted 08 September 2009 - 02:27 PM

Simply compare the codepoint to see if it falls within the relevant character range. All CJK characters fall within a number of contiguous ranges. See this page for specific values of each range.
  • 0

#3 share westmeadboy

westmeadboy
  • user photo
  • Members
  • 103 posts
  • Location:China

Posted 08 September 2009 - 02:48 PM

@imron - thanks for that.

I'm still rather confused about the whole thing. Is it enough to take into account these ranges:

Unified CJK Ideographs
CJK Ideographs Ext. A
CJK Ideographs Ext. B

I just want to detect traditional and simplified chars that appear in real life (in chinese).
  • 0

#4 share imron

imron

    Admin

  • user photo
  • Administrators
  • 10,900 posts
  • Location:国外

Posted 08 September 2009 - 02:59 PM

I would say yes. In fact for most characters, just the unified CJK Ideographs would cover it (extension A and B are mostly more obscure, rarely used characters - but definitely still wanted for absolute completeness). For things like the radicals and strokes, similar characters for these already exist in the main CJK Ideographs range and from my experiments with a limited number of radicals like 氵, the actual unicode output by IMEs is for the character in the main range, and not the separate radical ranges.

Also, you may want to also click the "symbols and punctuation" link, and include CJK punctuation.
  • 0

#5 share westmeadboy

westmeadboy
  • user photo
  • Members
  • 103 posts
  • Location:China

Posted 08 September 2009 - 03:30 PM

I should probably add that I want to do this as part of the dictionary app (I mentioned in another thread) at the point where the user enters some kind of search term. I want the app to automatically detect whether the user has entered hanzi, pinyin or english.

Maybe eventually I'll allow hanzi and pinyin to be mixed together - but this is probably for a later version...

Anyway, so I may as well check more ranges rather than less, because speed is not important in this part of the execution.
  • 0

#6 share imron

imron

    Admin

  • user photo
  • Administrators
  • 10,900 posts
  • Location:国外

Posted 08 September 2009 - 03:44 PM

Fair enough. For reference, the Unified CJK Ideographs cover almost 21,000 characters, and will cover almost anything your users will input. CJK A covers a further 6,500. Both of these ranges are in the BMP. CJK B contains a further 43,000 characters and is the SIP (so you'll encounter problems if you're using buggy UTF-16 code that doesn't realise that codepoints in the SIP are represented by 4-bytes instead of the usual 2).
  • 0

#7 share c_redman

c_redman
  • user photo
  • Members
  • 227 posts
  • Location:North Carolina

Posted 08 September 2009 - 10:20 PM

I just worked on a project that did this yesterday. :clap

Python:
# http://en.wikipedia.org/wiki/CJK_Unified_Ideographs
cjkUnifiedIdeographs = u'u4E00-u9FFF'
cjkCompatibilityIdeographs = u'uF900-uFAFF'
cjkUnifiedIdeographsExtA = u'u3400-u4DBF'
cjkUnifiedIdeographsExtB = u'u20000-2A6DF'  #not sure if correct syntax, but not using anyway
cjkEnclosedLettersAndMonths = u'u3200-u32FF'


# Non-CJK characters used in simplified/traditional field of CC-CEDICT
# I added these in by trial and error
# Some of these are covered in code range "Halfwidth and Fullwidth Forms". But this makes a stricter filter
cjkMiddleDot = u'u30FB'
cjkFullwidthComma = u'uFF0C'
cjkLingZero = u'u3007'
cjkFullwidthLatin = u'uFF21-uFF3A'

cjkRegexp = u'[%s%s%s%s%s%s%s]' % (cjkMiddleDot, cjkFullwidthComma, cjkLingZero, cjkUnifiedIdeographsExtA, cjkUnifiedIdeographs, cjkCompatibilityIdeographs, cjkFullwidthLatin)

Every simplified/traditional entry in CC-CEDICT, except for a few rare Unicode variants, is covered by
  • cjkMiddleDot
  • cjkFullwidthComma
  • cjkLingZero
  • cjkUnifiedIdeographs
  • cjkCompatibilityIdeographs (for just a few variants)
  • cjkUnifiedIdeographsExtA (for just a few variants)
  • cjkFullwidthLatin
Extension B and Enclosed Letters and Months (these are encircled numbers and characters) are not used at all.

Perl also has a module that aliases code ranges; for example:

use charnames ':full';    #use friendly 'InCJKUnifiedIdeographs' for Chinese pattern match

$cjkMiddleDot = 'x{30FB}';
$cjkFullwidthComma = 'x{FF0C}';
$cjkLingZero = 'x{3007}';
$cjkFullwidthLatin = 'x{FF21}-x{FF3A}';

$CJK_regexp = '[' . join('',
    'p{InCJKUnifiedIdeographs}',
    'p{InCJKUnifiedIdeographsExtensionA}',
    'p{InCJKCompatibilityIdeographs}',
    $cjkMiddleDot,
    $cjkFullwidthComma,
    $cjkLingZero,
    $cjkFullwidthLatin
    ) . ']';

# result: [p{InCJKUnifiedIdeographs}p{InCJKUnifiedIdeographsExtensionA}p{InCJKCompatibilityIdeographs}x{30FB}x{FF0C}x{3007}x{FF21}-x{FF3A}]

Mix and max the ranges used, for example if you want to include Latin fullwidth letters or punctuation.
  • 0

#8 share imron

imron

    Admin

  • user photo
  • Administrators
  • 10,900 posts
  • Location:国外

Posted 08 September 2009 - 10:46 PM

Don't suppose you remember off-hand which ones were from the compatibility ideographs?

I was under the impression that the compatibility ideographs were there to help with conversions to/from older standards but that the main and extended CJK ideograph range contained an exact same version of these characters just with a different code-point.

I wonder if it's a hang-over from when CEDICT wasn't stored as unicode and was then converted? If so, it might be worth finding the corresponding ideograph in the main ranges and submitting a patch to CEDICT. The main reason being that IMEs don't seem to output the codepoints for the duplicated compatibility characters (preferring instead to use the codepoint from the main CJK ideographs) meaning that searches for that character that come from user input would fail.
  • 0

#9 share c_redman

c_redman
  • user photo
  • Members
  • 227 posts
  • Location:North Carolina

Posted 08 September 2009 - 11:35 PM

This is from the most recent CC-CEDICT (20009-09-08)

CJK Compatibility Ideographs
蘭 蘭 [lan2] /Unicode compatibility variant for 蘭/orchid/
盧 盧 [lu2] /Korean variant of 盧|卢/
老 老 [lao3] /unicode compatibility variant of 老/
不 不 [bu4] /variant of 不/(negative prefix)/not/no/
練 練 [lian4] /variant of 練|练, to practice/to train/to drill/to perfect (one's skill)/exercise/
識 識 [shi2] /Unicode compatibility variant of 識|识/
兀 兀 [wu4] /duplicate of Big Five A461/

Almost all of these were in CEDICT at least at the time it was imported into CC-CEDICT, but so were the entries for the corresponding canonical character. So it wasn't a conversion error, rather just extra entries with marginal usefulness. Even if they can't be entered from an IME, they can still be copy-pasted, so they're not completely inaccessible, if someone ever encountered the character and needed to look it up.

CJK Unified Ideographs Extension A
㑇 㑇 [zhou4] /beautiful/
㑳 㑳 [zhou4] /beautiful/
㗂 㗂{u+35c2} [sheng3] /variant of 省/tight-lipped/to examine/to watch/to scour (esp. Cantonese)/
㝵 㝵 [de2] /to obtain/archaic variant of 得|得[de2]/component in 礙|碍[ai4] and 鍀|锝[de2]/
㥁 㥁{u+3941} [de2] /variant of 德, ethics/
㨗 㨗{u+3a17} [jie2] /variant of 捷/quick/nimble/
㬎 㬎 [xian3] /old variant of 顯|显[xian3]/visible/apparent/
㶸 㶸{u+3db8} [xie2] /(precise meaning unknown, relates to iron)/variant of 劦 or of 協|协/
㺵 㺵{u+3eb5} [jiu2] /black jade/variant of 玖/
䯝 䯝{u+4bdd} [sui3] /variant of 髓/marrow/essence/quintessence/pith (soft interior of plant stem)/
喎僻不遂 㖞僻不遂 [wai1 pi4 bu4 sui2] /facial paralysis and hemiplegia after apoplexy (idiom)/
奕訢 奕䜣 [Yi4 xin1] /Grand Prince Yixin (1833-1898), sixth son of Emperor Daoguang, prominent politician, diplomat and modernizer in late Qing/
恭親王奕訢 恭亲王奕䜣 [Gong1 qin1 wang2 Yi4 xin1] /Grand Prince Yixin (1833-1898), sixth son of Emperor Daoguang, prominent politician, diplomat and modernizer in late Qing/
綵 䌽{u+433d} [cai3] /variant of 彩/(bright) color/variety/multicolored silk/motley/variegated/
訢 䜣 [xin1] /pleased/delighted/happy/variant of 欣/

These are more recent entries, and appear to be from user submissions. I would guess they are from classical texts or fanciful writing.
  • 0

#10 share HarryCallahan

HarryCallahan
  • user photo
  • Members
  • 55 posts
  • Location:Australia

Posted 09 September 2009 - 02:51 PM

How about...

- parse the CEDICT dictionary and build list of Chinese characters contained within. (see my post http://www.chinese-f...-own-char-lists)
- store sorted instance of list in program
- search list for character, return true/false.

I would actually prefer this over any other implementation. Chinese characters are those characters used in written Chinese, as opposed to just 'Asian' characters.

Yeah you could do a range comparison, but it doesn't look to me that Chinese occupy one large contiguous block. (there's gaps in the codes)

I've attached a list I've just generated. That came from CEDICT and the large sample sentence file referenced in my post above. Doing both gets me another 2 characters over CEDICT alone. Total of 12,402 characters. Note this is with the default Hanzi regex filter on, there seems to be a few odd ones that come before 一, such as 䜣, 㑇, 㑳. Very odd, depends how perfect you want it to be. How many Chinese people know more than 10,000 characters? Also Japanese is contained in the CEDICT dict.

Attached Files


  • 0

#11 share imron

imron

    Admin

  • user photo
  • Administrators
  • 10,900 posts
  • Location:国外

Posted 09 September 2009 - 03:05 PM

but it doesn't look to me that Chinese occupy one large contiguous block.

They don't. As mentioned in my first post:

All CJK characters fall within a number of contiguous ranges

i.e there are several different contiguous ranges and you need to check them all if you want to determine if the character is Chinese. These ranges cover all ideographs common to Chinese, Japanese and Korean (CJK).

The reason checking these ranges is preferable to the method you propose is:
A) it's significantly faster and uses significantly less memory to check a few ranges than it does to search against a list of known characters (important characteristics for mobile application).
B) it doesn't limit you to an incomplete set of characters, but rather allows for any Chinese character that the user can enter, even rare/uncommon ones that aren't in dictionaries like CEDICT. You could choose something more complete, such as the Unihan database, but then the problems mentioned in A) become more pronounced.
  • 0

#12 share imron

imron

    Admin

  • user photo
  • Administrators
  • 10,900 posts
  • Location:国外

Posted 09 September 2009 - 04:14 PM

It's also worth pointing out that these ranges aren't just "Asian Characters". The Unicode Standard very clearly lays out ranges for each language. The so-called unified CJK ideographs, are those that are common to CJK languages, and so any character in this range can be considered Chinese. Characters specific to a given language (hiragana, katakana for Japanese, Hangul for Korean, and even Bobomofo for Chinese) each have their own distinct ranges.
  • 0


0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users