roddy Posted November 3, 2003 at 04:22 PM Report Share Posted November 3, 2003 at 04:22 PM Does anyone know of any lists of word frequency for Chinese? The kind of thing that tells you that 的 is the most common Chinese word, 今天 is the 97th, and 麟 is the 1054th? Note, I'm not looking for character frequency, but word frequency. I guess an HSK vocab list would be a step in the right direction - presumably the words at the lower levels are more frequent. Roddy Quote Link to comment Share on other sites More sharing options...
JoH Posted November 3, 2003 at 06:38 PM Report Share Posted November 3, 2003 at 06:38 PM Sorry, I don't know of a word frequency list - but do you mean that you already know of a character frequency list (on the web preferably)? If so, I'd be interested to know where. As for the HSK, I guess you are right. But they also seem to be quite selective about the vocab they use. I don't have the list to hand, but I remember that dentist wasn't on there (although doctor was), for example. Quote Link to comment Share on other sites More sharing options...
beijingbooty Posted November 3, 2003 at 06:39 PM Report Share Posted November 3, 2003 at 06:39 PM No, I have never seen one. I think you would be lucky to find that. It is too complex to put together and would be based too much on peoples own opinions and language styles rather than researchable evidence. As you say, there are many character frequency lists. As long as you can master 90% of the contents of an HSK dictionary then you will be fine. Quote Link to comment Share on other sites More sharing options...
roddy Posted November 4, 2003 at 12:34 AM Author Report Share Posted November 4, 2003 at 12:34 AM JoH, go to Zhongwen.com and click on Character frequency under Vocabulary - it's the best I know of. I got some clues here - but it looks like most of the stuff is for characters only and the HSK word lists are still the best bet (unless you want to do something daft like actually pay for a book). Roddy Quote Link to comment Share on other sites More sharing options...
baisong Posted March 4, 2010 at 06:46 AM Report Share Posted March 4, 2010 at 06:46 AM Here's something: http://sourceforge.net/projects/libtabe/ Haven't tried it myself but it says it has word frequency (not character freq.) Quote Link to comment Share on other sites More sharing options...
piasano Posted March 7, 2010 at 11:21 AM Report Share Posted March 7, 2010 at 11:21 AM wow, libtabe looks very useful! Quote Link to comment Share on other sites More sharing options...
querido Posted March 7, 2010 at 03:20 PM Report Share Posted March 7, 2010 at 03:20 PM Here is some related stuff. His CScanner and CWFC (Chinese Word Frequency Counter) use libtabe. For example, see this output. Quote Link to comment Share on other sites More sharing options...
Cactus543 Posted March 7, 2010 at 07:46 PM Report Share Posted March 7, 2010 at 07:46 PM Like Roddy said http://www.zhongwen.com has a pretty good frequency list Quote Link to comment Share on other sites More sharing options...
querido Posted March 7, 2010 at 09:25 PM Report Share Posted March 7, 2010 at 09:25 PM They're looking for word frequency. Quote Link to comment Share on other sites More sharing options...
tooironic Posted March 8, 2010 at 12:40 AM Report Share Posted March 8, 2010 at 12:40 AM Wenlin has one. 的 is #1, 一起 is in the middle and 野 is at the end. Quote Link to comment Share on other sites More sharing options...
mihobu Posted April 13, 2010 at 06:29 PM Report Share Posted April 13, 2010 at 06:29 PM Wenlin's word list seems to stop at 1,000 words. Not very comprehensive. Quote Link to comment Share on other sites More sharing options...
c_redman Posted April 14, 2010 at 08:08 PM Report Share Posted April 14, 2010 at 08:08 PM Some corpus-derived data from the University of Leeds: * A collection of Chinese corpora has links to lists from the Lancaster Corpus (top 5000 words) and a home-grown web corpus (top 50k words); and * Large Corpora used in CTS links to a list from the Chinese Gigaword corpus (top 25k words) Quote Link to comment Share on other sites More sharing options...
tooironic Posted April 15, 2010 at 12:14 AM Report Share Posted April 15, 2010 at 12:14 AM Wow, the corpora search engine in that first link is pretty cool. That will certainly come in handy some day if I want to search for real-life sentence examples without stuffing around on Google. Cheers! Quote Link to comment Share on other sites More sharing options...
buzhongren Posted April 15, 2010 at 02:37 PM Report Share Posted April 15, 2010 at 02:37 PM tooironic: Wow, the corpora search engine in that first link is pretty cool. I got the English examples to work. I tried some pinyin which worked. Is it possible to use Chinese characters in a corpus search other than simple words or characters. If so how about an example. I always wished Google would implement something like this. xiele, Jim Quote Link to comment Share on other sites More sharing options...
tooironic Posted April 15, 2010 at 11:20 PM Report Share Posted April 15, 2010 at 11:20 PM What do you mean? You can search by hanzi. E.g. Quote Link to comment Share on other sites More sharing options...
c_redman Posted April 16, 2010 at 05:47 PM Report Share Posted April 16, 2010 at 05:47 PM Maybe the issue is that you need put word breaks in the search terms. Unfortunately, it doesn't find a match if the words are not segmented the same way as the corpus. The corpus is segmented programmatically, so it may not be 100% perfect, either. Quote Link to comment Share on other sites More sharing options...
c_redman Posted April 16, 2010 at 08:30 PM Report Share Posted April 16, 2010 at 08:30 PM Two additional lists, from A Corpus Worker's Toolkit (http://www.humnet.ucla.edu/alc/chinese/ACWT/ACWT.htm). Within the software distribution, there are two files: ldc.dic has 44,000 unique words with frequencies, from a corpus of 4.9 million words; and wordlist.txt (no frequency data). Quote Link to comment Share on other sites More sharing options...
tooironic Posted April 17, 2010 at 01:27 AM Report Share Posted April 17, 2010 at 01:27 AM Hmm. I guess my question would be what possible pedagogical uses do these corpora have for the average student? Quote Link to comment Share on other sites More sharing options...
buzhongren Posted April 17, 2010 at 03:47 PM Report Share Posted April 17, 2010 at 03:47 PM tooironic: What do you mean? You can search by hanzi. E.g. For example I cant figure out the corpora syntax for conjunctions like 不但.*而且 xiele, Jim Quote Link to comment Share on other sites More sharing options...
c_redman Posted April 19, 2010 at 01:28 PM Report Share Posted April 19, 2010 at 01:28 PM Yes, the syntax is a little quirky; much of the example syntax doesn't work. But I managed to get something out this way: 不但 . . . . . . . . . . . . 而且 That will match with a gap of up to 12 words. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.