Frequently used chengyu project


I just talked with my girlfriend (Chinese) about it, and she wasn't too convinced. I then did the first 100, which looked more convincing to her. There's the problem with the list though, in that it lists traditional and simplified characters separately. So, you'd either have to combine them in the program, which is way more work than I care to do, or we'd have to use a list which only lists one kind.

Anyhow, i'll attach the first 100 chengyu (please note: this means the first 100 chengyu in the list that tooironic provided, sorted by google results, NOT the top 100 most-used chengyu of the list). If you think it's fine tell me, then I'll run the program over the whole list (or a better one, if anybody cares to provide one)

hmkay, I got something, but before I let it run over all 1600 chengyu, i let it do the first 30. Can someone with more Chinese knowledge than me tell me if they're more or less in the right ballpark, concerning relative fequency?

阿狗阿猫 4130000

Did you search for "阿狗阿猫" or for 阿狗阿猫 (with or without quotes)?

When I search for "阿狗阿猫" I have only 30900 hits using google.

Very good suggestion, BertR, the numbers have changed considerably:

爱不释手   654000
百感交集   195000
标新立异   188000
蹦蹦跳跳   160000
愛不釋手   156000
按兵不动   127000
杯水车薪   123000
按劳分配   114000
按勞分配   113000
爱面子   113000
背水一战   105000
暴跳如雷   105000
闭门造车   88500
抱佛脚   81900
变化莫测   75100
變化莫測   74300
班门弄斧   73100
安分守己   71500
八九不离十   68500
八九不離十   68400
半推半就   67900
爱莫能助   66900
阿猫阿狗   66000
白驹过隙   62500
半斤八两   62400
報仇雪恨   56600
报仇雪恨   56500
八仙过海,各显神通   42200
八仙過海,各顯神通   42100
百思不解   41500
八千里路云和月   38400
八千里路雲和月   38100
扮猪吃老虎   36800
扮豬吃老虎   36600
抱头鼠窜   34500
安之若素   32400
杯弓蛇影   29300
百口莫辩   29100
阿貓阿狗   27100
愛面子   20800
杯酒釋兵權   18100
杯酒释兵权   18000
背水一戰   17200
別具一格   17200
八竿子打不着   15500
比手划脚   13400
比手劃腳   13400
抱佛腳   13300
標新立異   12600
按兵不動   11900
杯盘狼藉   11400
閉門造車   10800
愛莫能助   10500
半斤八兩   9840
杯水車薪   9470
百廢俱興   8490
百废俱兴   8320
班門弄斧   7840
表里一致   7620
安家立業   7600
表裡一致   7560
安家立业   7510
八竿子打不著   7180
悲歡歲月   6920
悲欢岁月   6880
办事不牢   6700
阿狗阿貓   6630
辦事不牢   6600
阿狗阿猫   6540
豹頭環眼   5720
豹头环眼   5650
白刀子进,红刀子出   4850
白刀子進,紅刀子出   4780
百口莫辯   4490
杯盤狼藉   4260
边都沾不上   3620
邊都沾不上   3440
白駒過隙   3210
抱頭鼠竄   3020
彼一时,此一时   2890
彼一時,此一時   2860
暗暗自责   2460
拜倒石榴裙下   2340
鼻子气歪了   1760
鼻子氣歪了   1740
鼻子不是鼻子脸不是脸   1450
鼻子不是鼻子臉不是臉   1430
包在...身上   1260
百无一二   1200
百無一二   1200
碧眼童颜   584
碧眼童顏   576
彪腹狼腰   490
報讎雪恨   348
报雠雪恨   347
豹头猿臂   295
豹頭猿臂   295
绑赴市曹   242
綁赴市曹   241
暗暗自責   89

Yes, adding quotation marks would be a must!

One note though: so many chengyu have developed through history that just doing the first 100 of that already small list would not be much of an indication of anything. Remember, like I said earlier, that list contains not only chengyu but any idiomatic expression in Mandarin, so you get entries like 愛面子, 包在...身上, etc which are not chengyu. Still, go right ahead and do the entire list if you can! Jiayou~

EDIT: Also I might add that I think the fact that the list has both trad and simp forms might be kind of interesting as it might highlight differences between mainland China and Taiwan, HK, etc.

tooironic, it was just to see if the program works before i let it run for an hour to get the result for the whole list ;)

Since it seems to result sort-of-sane results now, I'll let it go over the list. I found out though, that the numbers I get from google are different from the ones you get when you go to google.com. I'll have a look at that. You guys go and find me more and better chengyu lists ;)

I got around that stupid google api. The results are very different now.. I'll need somebody to have a look at both lists and tell me which one is more accurate in assessing the frequency of the chengyus.

Below the first 100 chengyu of the list sorted by the estimated search results that google gives on google.com:

Here's the sorted list of the 1626 chengyus from tooironic. It's based on the numbers from the google webapi, not google.com. I'm waiting with that until somebody verifies that those numbers are really much more accurate, because going over google.com not only goes directly against the google terms of service, it also takes up to 10 seconds for each chengyu, so it'll take forever, and I want to know if it's worth it.


phyrex, still working on the MOE list. It would be great if you could run it for me, but I would also be interested in learning about the code. Xiexie.

tooironic, it looks like a nice list. It shouldn't be too hard to cross-reference it with lists that are already widely available. 1,600 is actually an OK number, provided that the list has really been edited with an eye towards frequency, it would be a good number of core chengyu. Though I doubt that was a concern for the wiktionary editors... I got maybe 2000-3000 by culling everything from CEDICT that is marked "idiom", "proverb" "literary saying" and the like, but had to enter a lot of quite well-known chengyu later.

Very interesting, thanks heaps for your hard work! It makes sense that chengyu like 随时随地, 无论如何, 不可思议, etc would be so prominent as they are very common. It would be great to get some input from native speakers though... :clap

chrix: Yeah, the list from wiktionary is very random, as most people just add whatever entries at the time that interest them. But still, it's a good starting point I think. :)

yeah, also tooironic, the next exciting question would be, what are the best strategies translators use when faced with chengyu? A lof of chengyu just translate into simplex words in English, some need circumlocutions, and some have "English chengyu" counterparts, and in some cases you could use the "the Chinese have a saying" spiel.

The code is ugly but easy.

I'll post it here and attach it for easier playing. Comments and criticism are welcome.

# -*- coding: utf-8 -*-
import json
import urllib
import os, sys
import codecs
from operator import itemgetter
from xgoogle.search import GoogleSearch, SearchError

# scrape from google.com
def googleWebSearch(currentChengyu):
  chengyuCount = GoogleSearch('"%s"' % currentChengyu).num_results
except SearchError, e:
  print "Google search failed: %s" % e

return chengyuCount

# use google web api
def googleAPIsearch(currentChengyu):
 query = urllib.urlencode({'q': currentChengyu}, {'hl': 'cn'})
 url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
 search_response = urllib.urlopen(url)
 search_results = search_response.read()
 results = json.loads(search_results)
 data = results['responseData']
 est = data['cursor']['estimatedResultCount']
 return int(est)


if len(sys.argv) < 3:
print "USAGE: chengyu.py listWithChengYus.txt fileNameForSortedChengYus.txt"

#read chengyu list
chengyulist = codecs.open( sys.argv[1], "r", encoding='utf-8') #might have to adjust encoding

chengyu_results = {};

# get results
for chengyu in chengyulist.readlines():
chengyu = chengyu.strip().encode("utf-8")
       chengyu = '"'+chengyu+'"'
if len(chengyu) < 1:
chengyu_results[chengyu] = googleWebSearch(chengyu)
print i,chengyu, chengyu_results[chengyu]

# sort list by frequency
chengyu_sorted = sorted(chengyu_results.items(), key=itemgetter(1), reverse=True)

# write sorted list
output = open(sys.argv[2], "w")
for chengyu in chengyu_sorted:
output.write(str(chengyu[0]) +"   "+ str(chengyu[1])+ "n")

EDIT: whoops, embarassing. I forgot the quotation marks again for the google websearch! I fixed it in the quote, but not in the uploaded file. Please note this when playing with the code!

the next exciting question would be, what are the best strategies translators use when faced with chengyu?

To be honest, I think that question is out of the scope of this forum, unless they create a sub-forum dedicated to translation studies and professional translators were around to answer such questions. I've tried to request one, but it seems the demand is not there. Still, you could create a new topic and see the answers you get.

well, one question doesn't does not an entire subforum make :mrgreen: is there some literature available on this, I'd like to learn more about it, is all....

roddy was discussing the future of the forums a day or two ago, maybe you can reiterate your request. I'd be happy to second you....

here are the 'first 100' google.com results, this time WITH quotation marks. Now it looks a bit uglier, but that could be easily fixed if needed.

"鼻子不是鼻子脸不是脸"   223000000
"鼻子不是鼻子臉不是臉"   119000000
"阿狗阿猫"   76300000
"阿狗阿貓"   75900000
"百廢俱興"   12700000
"百废俱兴"   11600000
"包在...身上"   9960000
"比手划脚"   8530000
"百無一二"   7260000
"百无一二"   7250000
"八千里路雲和月"   7200000
"边都沾不上"   6710000
"愛不釋手"   6590000
"爱不释手"   6580000
"八九不离十"   5540000
"彼一時,此一時"   4250000
"邊都沾不上"   3730000
"爱莫能助"   3460000
"彪腹狼腰"   3050000
"按勞分配"   3040000
"表里一致"   2920000
"表裡一致"   2900000
"按兵不动"   2510000
"爱面子"   2430000
"鼻子气歪了"   2370000
"变化莫测"   2230000
"變化莫測"   2220000
"闭门造车"   1830000
"閉門造車"   1830000
"別具一格"   1760000
"阿猫阿狗"   1730000
"拜倒石榴裙下"   1520000
"班門弄斧"   1510000
"鼻子氣歪了"   1380000
"按劳分配"   1100000
"比手劃腳"   985000
"標新立異"   935000
"标新立异"   935000
"八千里路云和月"   917000
"百感交集"   897000
"蹦蹦跳跳"   895000
"八九不離十"   761000
"愛面子"   729000
"阿貓阿狗"   720000
"按兵不動"   710000
"办事不牢"   698000
"辦事不牢"   697000
"抱頭鼠竄"   690000
"杯水車薪"   666000
"杯水车薪"   664000
"背水一战"   624000
"暴跳如雷"   622000
"背水一戰"   619000
"彼一时,此一时"   578000
"杯弓蛇影"   506000
"抱佛腳"   486000
"抱佛脚"   485000
"扮豬吃老虎"   456000
"豹头环眼"   436000
"白刀子進,紅刀子出"   425000
"愛莫能助"   420000
"白刀子进,红刀子出"   414000
"安分守己"   397000
"半斤八两"   393000
"半斤八兩"   391000
"班门弄斧"   382000
"暗暗自責"   350000
"報仇雪恨"   348000
"半推半就"   346000
"暗暗自责"   333000
"杯酒釋兵權"   329000
"报仇雪恨"   326000
"白駒過隙"   311000
"白驹过隙"   310000
"扮猪吃老虎"   231000
"百思不解"   224000
"抱头鼠窜"   222000
"八仙过海,各显神通"   213000
"八仙過海,各顯神通"   212000
"百口莫辩"   184000
"百口莫辯"   183000
"安之若素"   177000
"悲欢岁月"   166000
"悲歡歲月"   166000
"八竿子打不著"   141000
"豹頭環眼"   126000
"綁赴市曹"   105000
"八竿子打不着"   92100
"杯酒释兵权"   85000
"杯盤狼藉"   84900
"杯盘狼藉"   84700
"碧眼童颜"   72300
"碧眼童顏"   72300
"绑赴市曹"   68500
"豹頭猿臂"   68200
"豹头猿臂"   67300
"安家立業"   51300
"報讎雪恨"   49800
"安家立业"   48400
"报雠雪恨"   46000

EDIT: turns out, those numbers are not what google actually says. To get what google actually says, one MUST NOT add the "". *sigh*. Can't anybody settle on a standard? :(

well, one question doesn't does not an entire subforum make is there some literature available on this, I'd like to learn more about it, is all....

Hehe, and neither does just a handful of interested people. Honestly, the more I think about it, the more I realise that an entire subforum wouldn't really be appropriate here. From what I can tell, there are only about half a dozen professional translators who are active posters here, and, at any rate, that does not guarantee that they would be willing to discuss anything beyond language issues, which are adequately covered in the other forums. But I'd be happy to discuss it further with you via email or MSN. I'll PM you. :)

tooironic, well I know there's the fanyi mailing-list, but if you half a dozen translators on this forum, that's already a but more than those interested in Classical Chinese. But we have been quite active and it seems that roddy is gonna give us a subforum :clap . So it's not so much about the number of people but more about constant activity over a period of time...

