Jump to content
Chinese-Forums
  • Sign Up

How accurate are frequency lists?


valikor

Recommended Posts

Hi Folks

Recently I've been studying some characters from the zein character frequency list (found here http://www.zein.se/patrick/3000char.html )

I have been surprised to find that some of the characters which are supposedly quite common actually seem to be somewhat rare/not useful. Has anyone else noticed this? Some examples:

All of these are among the top-2000 most common characters, according to the list. But in some cases, after consulting a dictionary and asking chinese people "What are some useful/common words in which you might use this character?", all I got was an obsure word or chengyu. I know some characters are mainly used in surnames...

Cross-referencing with another frequency list, I found similiar results (ie, if a word was number 1800 on one list, it might be 2200 on the other... but fairly consistent).

Any thoughts? Is the analyzed material scurring the results towards words not used in daily life? Or...?

Thanks

David

  • Like 1
Link to comment
Share on other sites

I would say all of those are relatively common, except perhaps for 翠 which I had to look up. Apparently it appears most frequently as 翠鸟 cuìniǎo kingfisher. The rest of these characters are either common in Chinese names or actually make common words by themselves, e.g. 茫茫 mángmáng boundless and indistinct, 乾[干] gān dry, etc... But then again definitions of what is considered "common" differs.

  • Like 1
Link to comment
Share on other sites

翠 is also used in names. Off the top of my head, one of the maids in Ba Jin's "Family" had in her name, as did the female lead in 潜伏, a popular recent TV series. 乾 does not mean dry in simplified texts (like tooironic writes, it is simplified to 干), but it shows up in names as qian2. One of the most famous Chinese Emperors Qianlong used this character, which is enough to make it very common in written text (Qianlong is one of the few Qing emperors held in high esteem by the Chinese today, the other one is Kangxi). In fact, most of the characters you listed are common in names.

This is exactly the problem with frequency lists -- you get things out of context. You will also run into archaic characters that are only used in chengyu (but the chengyu are very common), and which can't be used alone, things only used for transcribing strange sounds, and things like that.

But I still find frequency lists useful. Just don't assume that something is really important just because it shows up in such a list. If the characters seem odd and rare, then you probably don't need to learn it yet.

Actually, the frequency lists are the most useful as a tool to check from time to time how far you've got and whether you are missing some important characters. Not to base your studying on, the HSK vocab lists are much better for this.

  • Like 3
Link to comment
Share on other sites

It's not quite what you're asking, but currently I feel that learning based on character frequency lists to be less-than-optimal after, say, 500-700 characters. First is for all the reasons mentioned above. Second, being able to read Chinese requires knowing words, not characters. While obviously there is some relationship between the two, they are not the same, and for actual studying I would recommend focusing on words, not characters.

  • Like 4
Link to comment
Share on other sites

There is a linguistic measure called dispersion, which is a scale from 0 to 1 of how evenly spaced a word is. It sounds like you are looking for a list that modifies the frequencies or at least notes these overrepresented characters. Depending on your learning goal, a character that shows up 1 time in each of 20 different texts may be more useful to know than one that shows up 20 times in one text and nowhere else.

翠绿 (already mentioned above) is a word used frequently in 哈利波特与魔法石. It seems nothing is ever just plain green in that book.

  • Like 2
Link to comment
Share on other sites

Thanks everyone for your helpful replies. To add one more note:

Within one week of learning all of these characters (which I initially thought of as obscure and non-useful, since I couldn't find many good words in my dictionary that include them), I have encountered all of them except 桑.

I watched a movie where 凯文 was "Kevin", saw 赫 at least once (I forget where), saw 鹏 in the name of some kind of soap in a TV commercial, saw 茫茫 used, and saw 乾坤 used in some bad chinese TV show.

As some of you said--maybe they're not all that uncommon after all!

Link to comment
Share on other sites

I find that happens a lot in language. It's possibly a kind of Baader-Meinhof Phenomenon, or an inflated sense of importance to coincidences. But I think it's more likely perceptual vigilance or selective attention. I can ignore an unknown word for years until I finally have the spare brain cells to learn it. Then, once I learn it, I am suddenly aware of seeing it other places, especially since it was recently learned and thus fresh in my memory (the recency effect). I didn't learn the word 生活 for 2 1/2 years. How could I not have noticed it? Obviously, once I learned it, I saw it everywhere.

  • Like 1
Link to comment
Share on other sites

  • 1 year later...
  • New Members

I'm not the first to point out that character frequency lists are unreliable. I've come across 3 or 4 that many commonly used characters. If the sources such lists are derived from are not carefully chosen to reflect variation, they won't be representative. An example is Jun Da's 9,933 character list. Dozens of the the characters I've studied are not included in his rather exhaustive list.

Link to comment
Share on other sites

I feel like I never see 茫 other than in my Anki deck. I do wonder if it gets counted twice in a frequency list when it shows up in the word 茫. Perhaps that's why it's rated as being more common than, in my experience, it really is?

Also 赫 is for Hertz, that might be one of those words which shows up many times in a small number of texts, giving an inflated impression of its importance.

Link to comment
Share on other sites

Both of these show up quite regularly in novels. 赫 not for Hertz, but for example in things like 赫然.

Dozens of the the characters I've studied are not included in his rather exhaustive list.

Dozens out of almost 10,000 is not a bad ratio. It's not meant to be an exhaustive list though, and will only be as accurate/good as the corpus it was based on.

  • Like 1
Link to comment
Share on other sites

  • 4 months later...

The question 'how accurate is ...' only makes sense if there is something to compare it to that you know is accurate. As the total corpus of written (let alone spoken) Chinese is immeasurably vast and expanding all the time, all we can do is look at slivers of it taken at certain times. The bigger the sliver, the more up-to-date it is, and the more closely the selection of text matches what you are likely to encounter, the more accurate the resulting list will seem. No two lists will be the same, and the only way you can criticise the accuracy of a list is to compile your own, from a bigger, more recent, and more widely-ranging corpus. Then someone else will come along and claim that yours is not accurate.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...