Jump to content
Chinese-Forums
  • Sign Up

Chinese word frequency count understanding


feng

Recommended Posts

I wanted to share a little about my experience of hitting the 1000 character mark and perhaps to be able to hear others experience.

Like many people, I have been aware of web sites like the one linked below, which suggest that knowledge of a certain number of characters, correlate to a certain level of understanding. I wanted to share my own personal experience of just how wrong this is, but also as a word of encouragement for people starting out.

http://www.zein.se/patrick/3000en.html

  • Anki suggests I know 1,200 characters, realistically I probably can read and recognize about 1,000 characters.
  • Right now, of the material I am seeing every day, road signs, writing in shops, children's books, chinese facebook comments, and newspapers. I would recognize anywhere between 50% to 100% of the characters. I probably understand the gist of maybe 50% of sentences I come across. To suggest that knowledge of 1,000 characters maps to 89% understanding is ludicrous.
  • On the positive side, because I able to recognize 50% to 100% of characters in a sentence seems to be speeding up the rate at which I learn, ie:
    • When I come across a new word, much of the time there is no need to learn to recognize a new character.

    • Because I can read (even without understanding), I can read the sentence aloud to my wife, ie I can quickly get help in understanding.

I would be interested to hear about the experience of others when they hit the 1,000 character mark?

Link to comment
Share on other sites

Congratulations on hitting the 1,000 character mark! Only 3-4 more thousand to go! :mrgreen:

Regarding these statistics, first of all, you need to realise the source of the frequency data. That site you linked to lists Jun Da as one reference, and if you go to their site you'll see that they have a number of different frequency statistics depending on whether you are reading newspapers, or literature or a mix of both. Those statistics are quite different depending on the type of things you are reading, and going to be even more different if you throw things like road signs, facebook comments and the like in to the mix.

Secondly, you should keep in mind that these figures just present frequency information of characters in a given corpus, and don't necessarily correlate to understanding of the text. While these two things will be quite close for a native speaker, for learners of the language character frequency vs level of understanding will be very different. As has been mentioned multiple times over the years, words are just as important for understanding as characters, because while a native speaker who knew the character 即 and the character 使 would know what 即使 meant if they saw it written down (because they know 即使 from the spoken language), the same is not necessarily true for a learner.

Finally, 89% sounds like a lot, but actually it's still a pretty poor rate for understanding a given piece of text (one in ten characters unknown). Other discussions have mentioned studies that show ~98% comprehension is around the level you need for comfortable understanding - although that would also be assuming 98% of words, not just 98% of characters.

For reference, the (mainland) government specifies that 2,000 characters is the minimum required for basic literacy among urban residents - so you're halfway there (which corresponds nicely with your 50% figure :D). As mentioned before though, for learners it's different than for a native speaker. I've also noticed that before about 2,000 characters, learners tend to be quite highly focused on the number of characters they know. After that level is the point when the focus starts to turn more towards words.

  • Like 1
Link to comment
Share on other sites

With regarding understanding that the word frequencies are based on specific "corpus" of text, I believe this is clear. However I guess I wanted to make the point that these types of lists don't seem to match with every day real life living in (in my case) Taiwan.

Your point about the focus switching from a character count to a word count matches my experience, ie often I can "read" all of the words in a sentence, but still not know what any of the words mean are (:

The 2,000 character figure is not one I have heard of, but it is encouraging to hear, I guess that will be my next milestone. Reference of a 2,000 figure for "office workers" and 1,500 figure for "farmers" does appear here:

http://www.accu.or.jp/litdbase/policy/chn/index.htm

Link to comment
Share on other sites

Here are a couple of related discussions that also provide sources for that figure:

Reading fluency

Will China ever switch to pinyin as its writing system?

Edit: It's important to realise however that this figure is for *native* speakers only, not for learners. At 2,000 characters, you'll still have a while to go before you have comfortable comprehension.

Link to comment
Share on other sites

Hitting 1000 was an accomplishment, but I plateaued about there, saw I wasn't moving as quickly as before, and kind of lost interest for a while -- trying to finish up a degree in not-Chinese. That was 10 years ago. I'm back at it and really just enjoying the process. I think I'm more patient now than I was in my 20's. : )

The problem I ran into is that they all started looking really similar! The nice thing is that as you read more, you are more comfortable with the vocabulary and you know what words to expect. So even when two characters that look the same come up in text, you know what words fit and you recognize those by sight and context.

Congratulations on the 1000 mark! Good luck on the next few thousand and remember if it starts going slow to just be patient and keep going!

Link to comment
Share on other sites

Yeah, at this point the drop off in effort vs reward is quite staggering, and suffers strongly from the law of diminishing returns. Going by the figures linked to above, your first 1,000 characters get you approx. 89% coverage, your next 1,000 characters get you an extra 8% coverage, the next thousand get you an extra 2% coverage and so on.

This makes it really easy to feel like you're plateauing even though you're still making progress. The trick is to just make sure you keep at it, and keep at it, and keep at it.

  • Like 1
Link to comment
Share on other sites

Finally, 89% sounds like a lot, but actually it's still a pretty poor rate for understanding a given piece of text (one in ten characters unknown). Other discussions have mentioned studies that show ~98% comprehension is around the level you need for comfortable understanding

This is true. I think the concept could be understood with a simple example. If somebody gives a lecture and says. "Today I'm going to talk with you about an important issue to all of us, namely x."

If you don't know the meaning of the 'x' word, but know all the others, it really is going to mess up your comprehension. Because you understand 95% of the words, does not at all mean you could say your comprehension of the idea was 95%.

Link to comment
Share on other sites

Hello,

Like many people, I have been aware of web sites like the one linked below, which suggest that knowledge of a certain number of characters, correlate to a certain level of understanding. I wanted to share my own personal experience of just how wrong this is, but also as a word of encouragement for people starting out.

4-5 months ago, I had studied about 1200 characters and was disappointed to see that I could not understand real material (magazines, stories, news...) beyond readers for learners. So I decided to make a conscious effort to teach myself about 1100 more characters over a relatively short period of time in the hope of taking a quantum leap and of reaching a 98.0% understanding level (according to Zein). So I focussed on doing just that, drilling characters, at the expense of all the rest. It's been hard - and it still is, because I'm still struggling with the backlog of my Anki "Characters" deck -, but on the whole it's been very useful. As I know more building blocks, I can guess the meaning of many words and learn more words much faster. I've even started reading real Chinese material with pleasure (currently reading 余华's 许三观卖血记) (caveat: with Pleco reader, or Perapera in my browser, etc., which makes it a bit too easy. Next goal: read a book on paper).

By the way, you don't study "just characters and no words" or, obviously, "just words and no characters". Even when focussing on the study of characters, I found that I can memorize them easier if I try to see which words they can form (more and more frequently, I will know the other character of the same word) and if I complement drilling with daily reading. At that level (1000-2000 characters), you just can't read anything without coming across what you have just learned, which reinforces the memorization process.

So aiming for the 2000-2300 mark made a real difference in my case. But then, I still come across many texts, perhaps the majority of them, where my comprehension level drops to almost zero (see jon831's example), because I only recognize, say, 50-70% of characters.

Link to comment
Share on other sites

Sometime I think the 'learn words not characters' advice is overstated, but I think that's because it's often given to people who say they have learned lots of characters and are surprised they can't understand newspapers (for the reason imron wrote above). In fact for me words and characters complement each other: learn lots of characters, and you can guess (and more easily learn) new words. Learn lots of words and you'll get a feel for what a certain character tends to mean. I think of them as two people climbing a mountain, roped on to each other, and helping each other up. But the rope isn't very long!

Link to comment
Share on other sites

By the way, you don't study "just characters and no words" or, obviously, "just words and no characters"

I agree with this too. I'll always look at and get an understanding of the meaning of the characters involved in a word I don't know, but the main semantic unit I'll commit to memory is the word. Even for single character words, rather than remembering just the single character, I rather try to find a multi-character word containing that character that most-closely resembles the single characters meaning (this is possible for all but a tiny percentage of characters). This has the added benefit of reducing homonyms.

Edit: with reference to realmayo's comments, I guess when I say learn words not characters, perhaps a better way to put it would be learn characters from the context of words :D

Link to comment
Share on other sites

I guess it also makes a huge difference WHICH characters you're learning.

I'm just about done with Heisig (which is far from a perfect book), and when I'm watching TV or reading comics, it is rare that I come cross a character I don't.know or can't guess its meaning.

Link to comment
Share on other sites

Heisig started out as Japanese Kanji, but then branched out into Chinese Hanzi using the same approach.

He has a book for traditional and another for simplified character. It is a good, but hugely flawed, book.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...