Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
valikor

How accurate are frequency lists?

Recommended Posts

valikor

Hi Folks

Recently I've been studying some characters from the zein character frequency list (found here http://www.zein.se/patrick/3000char.html )

I have been surprised to find that some of the characters which are supposedly quite common actually seem to be somewhat rare/not useful. Has anyone else noticed this? Some examples:

All of these are among the top-2000 most common characters, according to the list. But in some cases, after consulting a dictionary and asking chinese people "What are some useful/common words in which you might use this character?", all I got was an obsure word or chengyu. I know some characters are mainly used in surnames...

Cross-referencing with another frequency list, I found similiar results (ie, if a word was number 1800 on one list, it might be 2200 on the other... but fairly consistent).

Any thoughts? Is the analyzed material scurring the results towards words not used in daily life? Or...?

Thanks

David

  • Like 1

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

tooironic

I would say all of those are relatively common, except perhaps for 翠 which I had to look up. Apparently it appears most frequently as 翠鸟 cuìniǎo kingfisher. The rest of these characters are either common in Chinese names or actually make common words by themselves, e.g. 茫茫 mángmáng boundless and indistinct, 乾[干] gān dry, etc... But then again definitions of what is considered "common" differs.

  • Like 1

Share this post


Link to post
Share on other sites
renzhe

翠 is also used in names. Off the top of my head, one of the maids in Ba Jin's "Family" had in her name, as did the female lead in 潜伏, a popular recent TV series. 乾 does not mean dry in simplified texts (like tooironic writes, it is simplified to 干), but it shows up in names as qian2. One of the most famous Chinese Emperors Qianlong used this character, which is enough to make it very common in written text (Qianlong is one of the few Qing emperors held in high esteem by the Chinese today, the other one is Kangxi). In fact, most of the characters you listed are common in names.

This is exactly the problem with frequency lists -- you get things out of context. You will also run into archaic characters that are only used in chengyu (but the chengyu are very common), and which can't be used alone, things only used for transcribing strange sounds, and things like that.

But I still find frequency lists useful. Just don't assume that something is really important just because it shows up in such a list. If the characters seem odd and rare, then you probably don't need to learn it yet.

Actually, the frequency lists are the most useful as a tool to check from time to time how far you've got and whether you are missing some important characters. Not to base your studying on, the HSK vocab lists are much better for this.

  • Like 3

Share this post


Link to post
Share on other sites
aristotle1990

翠 also appears in common chengyu like 翠色欲流 and 翡翠鲜笋煲. And don't forget 翠绿, which is an HSK word.

  • Like 1

Share this post


Link to post
Share on other sites
gegehuhu

翠湖, or Green Lake, is a famous lake and park in the middle of Kunming.

  • Like 1

Share this post


Link to post
Share on other sites
jbradfor

It's not quite what you're asking, but currently I feel that learning based on character frequency lists to be less-than-optimal after, say, 500-700 characters. First is for all the reasons mentioned above. Second, being able to read Chinese requires knowing words, not characters. While obviously there is some relationship between the two, they are not the same, and for actual studying I would recommend focusing on words, not characters.

  • Like 4

Share this post


Link to post
Share on other sites
万里长城

桑 also appears in common words like 桑树 and 桑葚. And don't forget 桑麻, which is an HSK word.

  • Like 1

Share this post


Link to post
Share on other sites
c_redman

There is a linguistic measure called dispersion, which is a scale from 0 to 1 of how evenly spaced a word is. It sounds like you are looking for a list that modifies the frequencies or at least notes these overrepresented characters. Depending on your learning goal, a character that shows up 1 time in each of 20 different texts may be more useful to know than one that shows up 20 times in one text and nowhere else.

翠绿 (already mentioned above) is a word used frequently in 哈利波特与魔法石. It seems nothing is ever just plain green in that book.

  • Like 2

Share this post


Link to post
Share on other sites
valikor

Thanks everyone for your helpful replies. To add one more note:

Within one week of learning all of these characters (which I initially thought of as obscure and non-useful, since I couldn't find many good words in my dictionary that include them), I have encountered all of them except 桑.

I watched a movie where 凯文 was "Kevin", saw 赫 at least once (I forget where), saw 鹏 in the name of some kind of soap in a TV commercial, saw 茫茫 used, and saw 乾坤 used in some bad chinese TV show.

As some of you said--maybe they're not all that uncommon after all!

Share this post


Link to post
Share on other sites
c_redman

I find that happens a lot in language. It's possibly a kind of Baader-Meinhof Phenomenon, or an inflated sense of importance to coincidences. But I think it's more likely perceptual vigilance or selective attention. I can ignore an unknown word for years until I finally have the spare brain cells to learn it. Then, once I learn it, I am suddenly aware of seeing it other places, especially since it was recently learned and thus fresh in my memory (the recency effect). I didn't learn the word 生活 for 2 1/2 years. How could I not have noticed it? Obviously, once I learned it, I saw it everywhere.

  • Like 1

Share this post


Link to post
Share on other sites
LongShiKong

I'm not the first to point out that character frequency lists are unreliable. I've come across 3 or 4 that many commonly used characters. If the sources such lists are derived from are not carefully chosen to reflect variation, they won't be representative. An example is Jun Da's 9,933 character list. Dozens of the the characters I've studied are not included in his rather exhaustive list.

Share this post


Link to post
Share on other sites
WestTexas

I feel like I never see 茫 other than in my Anki deck. I do wonder if it gets counted twice in a frequency list when it shows up in the word 茫. Perhaps that's why it's rated as being more common than, in my experience, it really is?

Also 赫 is for Hertz, that might be one of those words which shows up many times in a small number of texts, giving an inflated impression of its importance.

Share this post


Link to post
Share on other sites
heifeng

I'm guessing you guys don't often feel 迷茫 or 茫然不解 if you don't think 茫 is very commonly used hehe j/k.

Share this post


Link to post
Share on other sites
imron

Both of these show up quite regularly in novels. 赫 not for Hertz, but for example in things like 赫然.

Dozens of the the characters I've studied are not included in his rather exhaustive list.

Dozens out of almost 10,000 is not a bad ratio. It's not meant to be an exhaustive list though, and will only be as accurate/good as the corpus it was based on.

  • Like 1

Share this post


Link to post
Share on other sites
Raphanid

翠华 is a popular chain restaurant in Hong Kong. 桑树 are important in silk production, which has cultural resonance in China. 凯里 is a fairly large city in 贵州.

Share this post


Link to post
Share on other sites
li3wei1

The question 'how accurate is ...' only makes sense if there is something to compare it to that you know is accurate. As the total corpus of written (let alone spoken) Chinese is immeasurably vast and expanding all the time, all we can do is look at slivers of it taken at certain times. The bigger the sliver, the more up-to-date it is, and the more closely the selection of text matches what you are likely to encounter, the more accurate the resulting list will seem. No two lists will be the same, and the only way you can criticise the accuracy of a list is to compile your own, from a bigger, more recent, and more widely-ranging corpus. Then someone else will come along and claim that yours is not accurate.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...