Jump to content
Chinese-Forums
  • Sign Up

Most Frequently used characters vs combination


ilprincipe

Recommended Posts

Dear fellow chinese students,

there is lot of discussions on what are the top 500,1000, or 2000 most used characters. however, it is important to note that simply knowing those top characters will get us nowhere to be able to read newspapers or understand the language. The real useful thing is to know the meaning of compund meanings. I believe we all agree on this.

I run some simple statistics and I see that the top 500 characters combined with one another can yield 7,000 meanings. the top 1000 about 17,000, the top 1500 about 25,000 and the top 2200 about 35,000. Many of them are rarely used, of course, so....

So the question for us is, instead of focusing only on the so-called most frequently used, how can we all find out the most frequently used 'compounds'? so that we can tailor our character learning to those compounds and actually be able to read newspapers etc..and not stumble on 明 and 白, without understanding that it means 'to understand' instead of 'clear' and 'white'?

any comment/feedback is most appreaciated to make progress on this field.

regards to all..

Link to comment
Share on other sites

Try to HSK vocab list, which are the words you are expected to know for the various levels of the HSK. There are 8000 words in the four levels combined, I think.

http://hskflashcards.com/download.php

Some comments on the HSK vocab list:

http://laowaichinese.net/hsk-vocabulary-levels-added-to-mdbg.htm

Just because a word has a lower rating, doesn’t mean it’s more commonly used. Here are a few ways the HSK info isn’t useful (to those of us not preparing to take the test):

1. HSK rating has nothing to do with spoken/written or formal/informal frequencies. For example, in my experience, computer is spoken much more frequently as “diànnǎo” 电脑 but is formally referred to (like if your major is computers in collge) as “jìsuànjī” 计算机. Both of these words appear on HSK list 3.

2. The difference between 1 and 2 is negligible. Vocabulary lists 1 and 2 are both covered by the Basic (lowest) test, so a word may appear on list 2 simply because they ran out of room on list 1. For example, “yǎnjing” 眼睛 gets an HSK rating of 1, but yǎn 眼 by itself is 2. Surely you’d know the single character before learning the two of them together.

The bottom line is: if a word has a HSK rating in the dictionary, it’s more likely to be a common word than one without a rating. Also, if I’ve got to choose between two synonyms (that really can be used interchangeably) I’m going to choose the one with the lower HSK number.

Link to comment
Share on other sites

Careful. Those are not words, they are bigrams. And as it says on that page:
Note: A bigram may be a nonsense combination of characters.
The bigram data only tells you the frequency of two characters appearing next to each other, not whether or not they are actually words (e.g. it could be the last character of one word followed by the first character of a different word).
Link to comment
Share on other sites

I think learning from wordlists is unnecessary. Of course if you enoy doing so, then fine - I'm not trying to talk anyone out of it, but I've never seen the need for wordlists myself, mainly for the following reason:

Different people have different interests and activities. A wordlist just contains general vocabulary that could be applicable to everyone, but not specific to anyone's personal habits. For example, if you want to be able to read a newspaper, sure some of the wordlist vocabulary may be useful, but there's probably going to be a lot of vocabulary from outside the wordlist. If you are interested in technology, and tend to read more of this kind of article, then you will need to learn more technology-related vocabulary. Likewise, if you are interested in sports, then you will need to learn more sports-related vocabulary. The best way to do this is just try to read articles you are interested in, and learn new vocabulary as and when you meet it.

Needless to say, if you read enough, the so-called "frequently used words" will appear frequently, so you will end up learning them first anyway.

Link to comment
Share on other sites

thanks for your replies, interesting stuff.

I agree of course that learning word list is a topic-specific issue and I am not suggesting someone should learn chinese that way. But the same maybe can be said for the character/flashcard approach...they both are list and you may or may not encounter them in your daily situation.

my question was aimed at trying to maximise the use from having learnt the top x-characters. For example:

Given that I know the top n (be it 500, 1000, or whatever) characters, what are the most used (IN ORDER OF FREQUENCY) words/compounds that make use of only and only those n characters? and I know we go back to the issue of topic (sport, news, etc..)..but ...

Link to comment
Share on other sites

The link in post #5 does almost what you want. It shows the top 3000 words (including compound words), ordered by frequency (I'm not sure how that correlates to character frequency, but you can be sure that all the characters used also have a relatively high frequency).

As for your other point, flashcards can easily be created using the new vocab you pick up from reading. Likewise, you can create your own revision word lists from those words. The basic principles behind character drilling and flashcarding are still applicable as the only thing that is different is the source of new words/characters.

The thing is, there is generally very little word frequency information publicly available, compared to say character frequency information.

This is why getting new vocabulary from articles works so well, because by nature of the process, these will be the most frequently appearing words in material that is of interest to you. After all, articles are essentially just word lists. Take for example the lead paragraph in a recent news article about the head of Google China resigning.

谷歌全球副总裁、大中华区总裁李开复将于今日正式辞职,在四年任期结束后最终选择离开。据可靠消息称李开复今后可能自主创业。至此,自2005年谷歌正式入华以来组建的创始团队已经悉数离开。

You could essentially just treat this as the word list:

谷歌

全球

副总裁

大中华区

总裁

李开复

今日

正式

辞职

任期

结束

最终

选择

离开

可靠

消息

李开复

今后

可能

自主

创业

至此

2005

谷歌

正式

入华

以来

组建

创始

团队

已经

悉数

离开

Which contains a whole bunch of highly relevant (and frequently occurring) words - assuming your interest is in say the Chinese business/technology sector - and is much more readily available than word frequency lists.

Link to comment
Share on other sites

Different people have different interests and activities. A wordlist just contains general vocabulary that could be applicable to everyone, but not specific to anyone's personal habits.

I agree, but I think that learning vocabulary lists like that are mainly useful in the early stages, to get a basic vocabulary that appears everywhere, all the time. Memorizing the 5000 core words or so can make many written materials accessible to you. And this, in turn, can give you a lot of context, a lot of new words, etc.

Above that basic level, of course you need to learn through exposure.

Link to comment
Share on other sites

I think that learning vocabulary lists like that are mainly useful in the early stages, to get a basic vocabulary that appears everywhere, all the time. Memorizing the 5000 core words or so can make many written materials accessible to you.

As I mentioned previously, I'm not trying to dissuade anyone from memorising vocabulary lists if that what's they wish to do, but personally, if I had a list of 5000 words to memorise during the early stages of learning a language, I think my enthusiasm for it would be killed off pretty quickly.

Link to comment
Share on other sites

points taken

you cannot just memorise characters in sequence, but maybe it can be helpful to memorise compounds formed by characters you already know..so no new character learning, just new combinations..

I also think that most combination have a logic, so more than memorising them, one needs to take a quick look, see the reason behind it, and it will then be hard to forget.

the links you all posted are very helpful, I was not aware of them, so thanks very much to all!

Link to comment
Share on other sites

  • 3 months later...

If I found some recourses I'm really interested in and would like to start understanding these, how do I know what's a word and what not?

I mean .. in the post #10 http://www.chinese-forums.com/showpost.php?p=200092&postcount=10

全球副总裁大中华区

how do I know the boundaries?

全球

副总裁

大中华区

My Lingoes doesn't recognize the words all the time.

My current status:

knowing around 200 characters (150 of the most used) and some words.

Speaking is way better than reading.

Thank you.

Bye, Chris

Link to comment
Share on other sites

Practice :-) Or use software that is capable of splitting a sentence into words instead of characters, e.g. Wenlin, Adsotrans etc (Edit: or my own Chinese Text Analyser).

It's worth noting however that software like this has its limitations and nothing is a substitute for having a good feel for the language and knowing where a sentence should be broken up. This feel usually comes after lots of reading/listening to native level materials.

Link to comment
Share on other sites

how do I know the boundaries?

That's where the 'learn' bit of learning Chinese comes in. There may well be a process of trial and error as you try and figure out if that is 全 球副总裁 大中 华区; or 全球 副总 裁大 中华区; or whatever, but you'll get there in the end. But at 200 characters, you're probably better off with material designed for learners.

  • Like 1
Link to comment
Share on other sites

  • 5 years later...

The thing is, there is generally very little word frequency information publicly available, compared to say character frequency information.

the query of word frequency list popped into my head this morning. A quick search on Google brought this thread up practically at the top. I can imagine a list of most common 2, 3 or 4 character words being quite useful.
Link to comment
Share on other sites

I think my point still stands though, which is that as you increase in level frequency lists become less and less useful because the material they are gathered from may be vastly different from what you might be reading, and so what is marked as a 'frequent' word my actually be quite infrequent and vice versa.

 

That was part of the motivation I had for creating Chinese Text Analyser, to make it easier to find frequency information of content you are reading.

Link to comment
Share on other sites

I think my point still stands though, which is that as you increase in level frequency lists become less and less useful because the material they are gathered from may be vastly different from what you might be reading, and so what is marked as a 'frequent' word my actually be quite infrequent and vice versa.

That was part of the motivation I had for creating Chinese Text Analyser, to make it easier to find frequency information of content you are reading.

I would agree with that.
Link to comment
Share on other sites

@imron I agree in general, but I do find having frequency lists based on popular media to be very useful.... for consuming popular media.

 

It's just I don't have a very complete collection of texts of popular Chinese media... otherwise I could make a great big text file and CTA it.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...