Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
gedawei

Is there an authoritative resource that measures how common terms 词 are in the Chinese language?

Recommended Posts

gedawei

I came across the term 费解 (unintelligible, puzzled) recently. It’s not in HSK. Seems like it could be a fairly common word, so that led me to wonder: is there an authoritative source yet for finding out just how common or uncommon a Chinese term is (i.e., separately from how common a character is)? Seems like there’d be a market for an easily lookup-able database online of Chinese terms which provides some kind of ranking in terms of frequency of use. 

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

Publius

Google Ngram is hardly an authoritative source, but it does give you a number, for example, 0.000080% for 费解 (in comparison, 费用 is 0.0140%, 175 times more common).

 

  • Helpful 1

Share this post


Link to post
Share on other sites
gedawei

Thank you!!

 

“Google Ngram is hardly an authoritative source, but it does give you a number, for example, 0.000080% for 费解 (in comparison, 费用 is 0.0140%, 175 times more common).”

 

Share this post


Link to post
Share on other sites
davidchenx

Hi my name is David, I'm new, and this is my first post.

 

There is one. I use Pleco and they sort the 词s by usage frequency. For example if you look up for the word 人, it will give you the 词s containing 人 by such order. I'm not 100% sure but it does look pretty much like that.

 

pleco screenshot

 

  • Helpful 1

Share this post


Link to post
Share on other sites
DavyJonesLocker
55 minutes ago, imron said:

The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading.

 

That means there is no authoritative source, and you're betting off calculating the frequencies of words from content that you are reading.  I made a tool that will help you do just that.

 

I agree with that sentiment. Its really not ideal to go through frequency tables on the chance you might come across it. I'd probably go a little further and say up to HSK 5 (maybe) is at least a fairly OK path to take, but then its high time to veer off, Above HSK5 and you really need to a good reason (like sitting the exam) to continue along the HSK word list . In anycase, bashing through a pile of flashcards before you read something is pretty difficult task. It is useful for a chapter of a book, text lesson etc limited to 50 or so and you are certain you will read it in the next few days.

 

Looking at my deck now, only it has 9618 words and only 3895 are from HSK1-6. This spreadsheet came from all sorts or sources, wechat, text , messages, shopping apps, graded readers, text books, odd movie, slang words people told me etc  However I probably only know 2/3 of this list 

  • Thanks 1

Share this post


Link to post
Share on other sites
roddy

It all depends on the corpus. Some are more useful than others. 

  • Like 1

Share this post


Link to post
Share on other sites
markhavemann

Some time ago I found a set of word lists put together by 北京语言大学 showing word frequency in different areas, there is a list compiled from newspapers, literature, Weibo, and one or two other sources plus a "global" one which I assume is all of them together. 

 

I have uploaded it to a folder in the Google Drive that I use for the Transcription Project. You can access it here: https://drive.google.com/drive/folders/1w2BPsbMmuruTONmr4xy6s1CwG7IJ5CTd?usp=sharing

 

There is also Subtlex which uses only data from subtitles of movies. 

 

No doubt the words are split using NLP which probably gives pretty good results but probably not 100%. 

 

 

I have in fact made a dictionary type thing which I use myself for looking up words and seeing their frequency.

 

Currently it only displays a percentage, which is the percentage of films that it appears in, based on the subtlex data that I linked to above, or no figure at all if it's not part of the Subtlex dataset. I do at some stage hope to increase it's scope and use a larger dataset with different categories. You are welcome to use it if you like, it's online here: http://www.sino-dex.com/

 

If you want use it you need to sign up to access it at all. (not because I want anyoen to sign up but simply because I made it that way and haven't had a need to change it since I'm the only one who uses it).

  • Helpful 1

Share this post


Link to post
Share on other sites
markhavemann

I clicked on Roddy's link only after posting, not realising that he already linked to it. Anyway, now it's been mentioned twice! 

  • Thanks 1

Share this post


Link to post
Share on other sites
anonymoose

You can buy word frequency books in China. I'm not sure how common they are, but I have seen at least one.

 

As others have pointed out, though, relative frequency will depend a lot upon what you are reading. I'm not sure how the books are compiled.

  • Like 1

Share this post


Link to post
Share on other sites
gedawei

Thanks Imron. I may just test it out. I’m at an HSK 5 or 6 level in terms of reading vocabulary. I recently picked up the novel 活着. A tool that would analyze the terms in the novel and help create wordlists seems like it would be helpful. I guess I need to get the digital version though, haha.

 

“The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading.

 

That means there is no authoritative source, and you're better off calculating the frequencies of words from content that you are reading.  I made a tool that will help you do just that.”

Share this post


Link to post
Share on other sites
imron
7 hours ago, gedawei said:

I guess I need to get the digital version though, haha.

I've recently been using https://www.shutxt.com/ to find digital copies of books, and it seems to be quite good.

 

Here's 活着.  The download links are on the left.

  • Helpful 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...