adrianlondon Posted January 20, 2006 at 04:40 PM Report Share Posted January 20, 2006 at 04:40 PM I was looking for information on something else (sending SMSs in chinese) and came across a comment on the Sony-Ericsson web site. It states that an average chinese word contained 1.6 characters. Is that a fact quoted elsewhere or did they make it up? Quote Link to comment Share on other sites More sharing options...
yingguoguy Posted January 20, 2006 at 05:00 PM Report Share Posted January 20, 2006 at 05:00 PM From Asia's Orthographic Dilemma here For running text, DeFrancis estimates Chinese ''as only 30 percent monosyllabic as against 50 percent for English material written in a style comparable to that of the Chinese" (1943:235). Zheng gives a higher figure of 40 percent monosyllabicity for Chinese texts (1957:50) DeFrancis: (1 * 0.3) + (2 * 0.7) = 1.7 Zheng: (1 * 0.4) + (2 * 0.6) = 1.6 I've assumed that anything that isn't a 1syllable word is a 2 syllable word as 3+ syllables only make up 2% and I don't want the hassle Seems about right, but this is for running text (i.e. counting the lengths of words in a normal text) not the total number of words in the dictionary. Quote Link to comment Share on other sites More sharing options...
Ferno Posted January 20, 2006 at 07:33 PM Report Share Posted January 20, 2006 at 07:33 PM hmm, pretty surprising for written text... does the percentage of 2-syllable+ words go up even more in the spoken language because of homonyms? Quote Link to comment Share on other sites More sharing options...
Lugubert Posted January 22, 2006 at 12:51 AM Report Share Posted January 22, 2006 at 12:51 AM Any page of my 汉英词典 has a majority of two-syllable "words". I'm not surprised of the 1.6-1.7 syllables - or characters?! per word. One thing that I would love to investigate for a thesis on a university 3rd semester level is those claims of "you need to know n characters or words to understand x % of a text in Chinese. I'm curious about how the claimants measure "understanding", their sample sizes, the selection of test subjects, methods used, definitions of a "word" etc. Links/comments, please! Quote Link to comment Share on other sites More sharing options...
roddy Posted January 22, 2006 at 03:32 AM Report Share Posted January 22, 2006 at 03:32 AM I have some stats I extracted from the HSK lists some time ago that might help. I wouldn't want to treat these as gospel, as there are issues. One is obviously the limited sample size. Also, some entries have punctuation in (ie, "除了。 。 。意外), and I'm not sure what effect that had. Also, I didn't bother looking for any entries above four syllables / characters. Can't remember off hand if there are any. So with that caveat . . . The 8785 words I looked at (should be 8821, but for reasons above not all were picked up) break down like this. Note that I was thinking in terms of syllables at the time, not characters, but it's the same thing. number of 4 syllables was: 186 which gives (744) characters 3 syllables 293 - (879) 2 syllables 6384 - (12768) 1 syllable 1922 - (1922) This gives us a total of 16313 characters in 875 words, making about 1.85 characters per word. Roddy Quote Link to comment Share on other sites More sharing options...
wushijiao Posted January 22, 2006 at 04:33 AM Report Share Posted January 22, 2006 at 04:33 AM I wonder if the Sony Ericsson people were counting by frequency. A lot of less frequently used words are two, three, and even four characters. On the other hand, a lot of single character words (like 我, 你, 在) are used more often, which would bring the number down towards 1.6. I don't know. Quote Link to comment Share on other sites More sharing options...
ala Posted January 22, 2006 at 05:12 AM Report Share Posted January 22, 2006 at 05:12 AM The number is also different for different Chinese dialects. I wouldn't be surprised if Shanghainese average 1.8-2.0, while Cantonese and Hakka are under 1.6. And I wonder if Mandarin words like 花儿 (huar) are being counted as one or two syllables? And is 社会主义 (socialism) one word or two? According to Xinhua dictionary, it's two words (Shehui Zhuyi); but according to Shanghainese tone sandh pattern it's one word (Zooweitsugni). To me the phonology-based Shanghainese word partition is more solid than the arbitrary Xinhua partition. Hence different definitions of word boundaries will greatly alter the above posts' calculations as well. Quote Link to comment Share on other sites More sharing options...
davidpbrown Posted January 22, 2006 at 12:59 PM Report Share Posted January 22, 2006 at 12:59 PM Lugubert I'm only starting to learn Chinese and stumbled into talk of character frequency; how knowing 2000 characters you can recognise ~98% of written chinese, 3000 => 99.9%. Obviously recognising individual characters not the same as understanding words and meaning. Cumulative frequencies in percentile for characters are listed here.. http://lingua.mtsu.edu/chinese-computing/statistics/index.html HTH Quote Link to comment Share on other sites More sharing options...
dean2000 Posted January 22, 2006 at 04:25 PM Report Share Posted January 22, 2006 at 04:25 PM I think it depends on how you calculate, for example: (1) using statistics-based methods, then the question is: how well the corpus is? (2) using a dictionary, then the question is: do you think this is a pure character or a word? 丁 (ding1) Quote Link to comment Share on other sites More sharing options...
ala Posted January 22, 2006 at 11:45 PM Report Share Posted January 22, 2006 at 11:45 PM I'm only starting to learn Chinese and stumbled into talk of character frequency; how knowing 2000 characters you can recognise ~98% of written chinese, 3000 => 99.9%. Obviously recognising individual characters not the same as understanding words and meaning. Yup, it's kind of like saying how by knowing just 26 letters in the English alphabet, I can "recognize" 99.9% of all English text. No one can guess from the characters alone that 社会 means "society" Quote Link to comment Share on other sites More sharing options...
Quest Posted January 23, 2006 at 12:25 AM Report Share Posted January 23, 2006 at 12:25 AM I'm only starting to learn Chinese and stumbled into talk of character frequency; how knowing 2000 characters you can recognise ~98% of written chinese, 3000 => 99.9%. Obviously recognising individual characters not the same as understanding words and meaning. Yup, it's kind of like saying how by knowing just 26 letters in the English alphabet, I can "recognize" 99.9% of all English text. No one can guess from the characters alone that 社会 means "society" That would be true for native speakers though. And, learning a radical here and there, and picking up some characters by seeing them in the streets and books and asking parents, native speakers don't really need to "learn" 3000 one by one either. Quote Link to comment Share on other sites More sharing options...
ala Posted January 23, 2006 at 05:17 AM Report Share Posted January 23, 2006 at 05:17 AM Yeah, native speakers are completely different situation, because native speakers can "pronounce" out the characters and the sound registers as a word that has meaning. Even if they don't know what 社会 is when they see it, once they pronounce out the characters 社 and 会 then they get she4 hui4, which registers as "society." But this is not possible for non-native beginners in Chinese. Quote Link to comment Share on other sites More sharing options...
Celso Pin Posted January 23, 2006 at 06:38 AM Report Share Posted January 23, 2006 at 06:38 AM assuming you know 2000 or 3000 characters, you are not a beginner at all... Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.