Jump to content
Chinese-Forums
  • Sign Up

Frequency Vocab List?


js6426

Recommended Posts

I have found many frequency lists for the most common Chinese characters, but is there any such list for the most common words (including multiple character words rather than just single characters)?  I have had a look around but haven't been able to find anything yet.

 

Thanks

Link to comment
Share on other sites

There's not a lot out there.  One I can think of off the top of my head is this one.  The problem with these types of lists however is that they may or may not be relevant to your vocabulary and to the type of content you are wanting to use.  The more advanced your level is, the more likely this is to be true.

 

Take the above list for example, it's generated from film subtitles, and so it will be relatively good for words found in dialog and spoken text, but relatively poor for words found in newspaper articles and novels.

 

If you'll excuse the shameless plug, I wrote a tool that lets you generate your own wordlists based on frequency and/or several other metrics, from any piece of Chinese text.  It will also keep track of your known vocabulary over time, so you can use it to export the top 10 unknown words from a given article, and so on.

Link to comment
Share on other sites

  • 3 weeks later...

After looking at the list I linked, it's obvious that it was generated from newspapers, and has more formal words that I don't need to worry about yet. For example, 領域, meaning scope or field of operation.

Instead, consider the SUBTLEX-CH list, generated from movie and TV subtitles. http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch

Link to comment
Share on other sites

The SUBTLEX-CH has its own issues. Mainly that the corpus it is based on contained a lot of translated subtitles for American movies and shows. So, you'll notice that a ton of transliterations of English proper nouns show up. And who knows what other biases are introduced from the fact that it's all translated material. For example, maybe some common chengyus are ranked really low because they wouldn't be a natural way to translate anything from English. Overall I see this as a serious problem with this particular frequency list.

If your goal is just to have a list of words to study then I think the official HSK vocab lists work very well for that purpose.

  • Like 2
Link to comment
Share on other sites

Well, some content will be more general and well-rounded than others. To me, a corpus that heavily features translated texts is particularly problematic from a linguistic standpoint, in a way that goes beyond the problem that every corpus will have of not being perfectly tailored to your own interests.

As for the HSK list, my opinion is that up through level 5 the words are frequent enough that it's worth it to learn the entire list straight through. At HSK 6 it starts getting murky and it becomes worth it to mine your own vocab from content that you are reading/watching.

Link to comment
Share on other sites

in a way that goes beyond the problem that every corpus will have of not being perfectly tailored to your own interests.

Which is why I advocate creating a corpus perfectly tailored to your own interests! :mrgreen: (or, if not your own interests, then at least what you are currently reading).

 

At HSK 6 it starts getting murky

I think it gets murkier before that.  Yes the words on the earlier lists have a high frequency in general texts, but there are also plenty of other frequent words that are not on these lists, and that will change quite significantly depending on what you are reading.

 

If you have a choice between learning words that might be relevant in a few months time, or words that will be directly relevant that day, or in the coming days, then for me it's really a simple choice.  You'll end up learning all the HSK vocab eventually, just on a more random schedule.

Link to comment
Share on other sites

  • 3 weeks later...

 

 

https://en.wiktionar...Frequency_lists

Thanks, iand, I've been looking for something like this for a while.

Noticed something interesting in the first list: 台灣 台湾 is no. 80? 

 

Academia Sinica (not "Academica" as it is spelled in Wikipedia article) is a Taiwanese academic institution, so their corpus probably reflects Mandarin as it is spoken and written in Taiwan.

Link to comment
Share on other sites

  • 1 month later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...