Jump to content
Chinese-forums.com
Learn Chinese in China
  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
sparrow

Spreadsheet of 10,000 Most Frequent Chinese Words (2397 Characters)

Recommended Posts

sparrow

Edit: Please read reply #3 by alanmd on this topic. The Wikipedia frequency list is apparently not what it claims to be and repeats words inappropriately.

Edit: Uploaded a corrected copy of the spreadsheet. There was a small error.

 

Using spreadsheet formulas, I was able to pull apart the Mandarin word frequency list found on Wikipedia.

 

Wikipedia Source

PDF File Discussing Methodology (Chen, Tseng, et al.)

 

According to the above PDF, the list comes from a 14-million-character corpus of Chinese newspapers dating 1993 or earlier.

 

Attached is the spreadsheet.

 

It contains Simplified, Traditional, Pinyin, and English. The comment in the top-left-most cell contains the RAND() formula, which can be used for sorting groups of characters randomly, essentially shuffling them. They can be put back in order by sorting by entry number.

 

If people want info on how I personally use this kind of list, let me know and I'll do a write-up.

 

Statistics

Word Set	Characters in Set
0001–2500	1119
0001–5000	1658
0001–7500	2048
0001–10,000	2397

Mandarin_10000_Word_Frequency_List.xls

Edited by sparrow
  • Like 3

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

戴 睿

I would love to read a bit more about how you implement this sort of list into your language study. I look forward to reading a write up of that nature!

Share this post


Link to post
Share on other sites
alanmd

The Wikipedia page and the Chen et al. (1993) paper you linked to have different frequency orderings. I'm not sure that I believe that '的' is only the 28th most frequent word in modern Chienese, as the Wikipedia list and the Excel file how, or that '了' is the 25th most frequent, as the Chen et al. (1993) paper shows. Must be some quite strange source texts to give those frequency results.

 

An interesting feature of the list on Wikipedia is that homographs are broken out into multiple entries, which implies that the meaning of each word is known when compiling the list- this would be very hard (and not impossible) for a computer to do, and the 1993 paper only talks about word segmentation not homographs so wouldn't have been able to generate the Wikipedia list. It would be interesting to know how this list was created. There seem to be some errors though, entries 758 and 759 both show 推出 with the same meaning, I did a quick count in Excel and found 1189 totally duplicate entries.

 

I usually use the SUBTLEX-CH word frequency lists (Cai & Brysbaert, 2010), as I like their methodology of using subtitles to better measure word usage in modern speech. I don't really use these for studying, except to frequency order all of the words and characters in the Chinese language scripts I write, e.g. http://hskhsk.pythonanywhere.com/radicals?hsk=16 . Maybe someday I'll get around to doing an in-depth comparison of these frequency lists, and try to see which words differ most between them- might help to get a better idea of the advantages/disadvantages of each.

 

For comparison the first few words of each are:

  • Wikipedia: 一,在,有,个,我,不,这,了,。。。
  • Chen et al. (1993): 的,一,在,十,是,有,二,三,。。。
  • SUBTLEX-CH: 的,我,你,是,了,不,在,他,。。。
  • Like 4

Share this post


Link to post
Share on other sites
sparrow

@alanmd:

Interesting—if the list did not come from Chen et al., I wonder where it came from!

 

I noticed there are many repeat entries, and I noticed 的 is quite far down the list, which made me suspicious, so I've continued to use Routledge's frequency dictionary.

 

To be clear, did you find 1189 duplicate entries for the same word? If so, then this list of 10,000 might be rubbish. In that case, I will make a bold note in the OP about this so that people know they're getting a low-quality list and probably should look elsewhere.

Share this post


Link to post
Share on other sites
alanmd

I created a new column which was Simplified & Trad & pinyin & definition. I then sorted on that column, and made a column that counted consecutively identical rows as a '1', otherwise '0'. The 1189 count would have counted once-duplicated entries as '1', and twice-duplicated as '2', etc. I didn't test for the number of unique duplicates, if you know what I mean! :)

Share this post


Link to post
Share on other sites
imron

Frequency lists depend completely on the source material that they came from.  If two different data sources were used to collect frequency information, then it's understandable that different characters will appear in different positions.  It's not an exact or precise figure.

Share this post


Link to post
Share on other sites
alanmd

Yes, they do depend on the source material, but they also depend on the way that words are defined, for example some frequency lists consider 我们 to be one word, and 一个 to be one word, while others consider each to be two words. Even worse they often use imperfect algorithms to determine word breaks. There can also be errors in the way they are constructed (like the duplicates I noticed above), and as an added complication the list mentioned above attempted to give different entries for homographs, so it's likely they were using some sort of AI or statistical technique to determine which 了 was being used in a give sentence, etc.

 

When people say "the most common X words in Chinese are ..." as if it is something definitive, they are of course always failing to add the disclaimer (based on corpus Y, using word segmenting algorithm Z, counting words by their appearance in dictionary A, etc...)

 

If a frequency list throws some seemingly obscure words as the most frequent, and buries some apparently common words as being less frequent, then it's perfectly valid to question the list- there could be problems with it using a not very representative corpus, or other issues that I mentioned above. 
 
The main reason I was comparing the lists above was to show that the Wikipedia entry and the cited paper didn't match, so they were generated in different ways or based on different corpora.

Share this post


Link to post
Share on other sites
c_redman

The wiktionary entry isn't clear at all where this data came from. In addition, the words have both simplified and traditional forms. but it doesn't say whether the corpus was from traditional or simplified and then converted to the other form. And just one more nitpick: It is called "Mandarin Frequency List" but doesn't specify whether it's words or characters, until you click on it.

There is a related conversation at https://www.forumosa.com/taiwan/viewtopic.php?f=40&t=122213. It was suggested that the duplicates are due to words being counted by part of speech. Someone also suggested the list came from 中央研究院-現代漢語標記語料庫 Academia Sinica Balanced Corpus of Modern Chinese

  • Like 1

Share this post


Link to post
Share on other sites
imron
If a frequency list throws some seemingly obscure words as the most frequent, and buries some apparently common words as being less frequent, then it's perfectly valid to question the list- there could be problems with it using a not very representative corpus, or other issues that I mentioned above.

Yep, I was just trying to point out to people reading the thread that while frequency lists can be useful, they are not absolute.

Share this post


Link to post
Share on other sites
roddy

Were they Taiwanese papers? There's suspiciously good coverage of Taiwan city names, and I think 网路 is a Taiwanese usage.*

 

Personally for learners  I'd just run off the HSK lists, perhaps going back to the older ones which covered I think almost 9,000 words. You don't get fine-grained frequency information, but I don't see how important that is, and it'll give you more useful vocab - you're 1500 words in before this covers 名字, and for some reason it misses both 明天 and 昨天, and even 昨日,明日.

 

The subtitle corpus is a good idea, but you could get some oddities: 

 

奋斗 corpus: 1 钱; 2 车; 3 房

武林外传 corpus: 1确定;2一定;3肯定

 

*Ah, just noticed the Academica Sinica reference, probably was. 

  • Like 1

Share this post


Link to post
Share on other sites
alanmd

I agree with roddy #10, the HSK lists introduce words in a very sensible order. They won't be perfect for everyone but no list is;you need to add words relevant to you as you need or come across them (no way on earth I'm waiting til HSK 6 to learn 串 so that I can order 羊肉串!).

 

Here are all HSK words, grouped by level and ordered within each level by subtitle corpus frequency (which is a pretty good defeault ordering to learn them in) http://hskhsk.pythonanywhere.com/hskwords . You can hover over words for more info, or click 'expand definitions' for a massive page with info on each word inline. I also have flashcard file versions of these lists on my site.

 

The only problem I've found using the subtitle corpus is that it considers some compounds of two words to be words in their own right. This doesn't matter too much with the way I am using it however.

  • Like 1

Share this post


Link to post
Share on other sites
sparrow

@戴睿长老 #2:

I made a new topic explaining how I use a spreadsheet to study. This is the link.

 

@alanmd #11:

I'm curious: How are you using the SUBTLEX-CH corpus?

Share this post


Link to post
Share on other sites
alanmd

I am using it to order characters and words by frequency in the scripts that I've written and linked to above. I created use and flashcard files of the HSK words ordered by frequency, so I know that within each level I am learning the highest frequency words first. I also made a big wall chart of HSK 1-6 words, using the frequency lists as weights so that the highest frequency words are nearer the middle, with all words on nodes that are coloured by HSK level (it looks very pretty!). On my other HSK charts I make more frequently occurring words at each level slightly darker, to emphasise their importance.

Share this post


Link to post
Share on other sites
Daedalus

Thanks Sparrow. Very useful for determining which synonyms I best memorise first. 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×