Jump to content
Chinese-Forums
  • Sign Up

Frequently used chengyu project


chrix

Recommended Posts

I just talked with my girlfriend (Chinese) about it, and she wasn't too convinced. I then did the first 100, which looked more convincing to her. There's the problem with the list though, in that it lists traditional and simplified characters separately. So, you'd either have to combine them in the program, which is way more work than I care to do, or we'd have to use a list which only lists one kind.

Anyhow, i'll attach the first 100 chengyu (please note: this means the first 100 chengyu in the list that tooironic provided, sorted by google results, NOT the top 100 most-used chengyu of the list). If you think it's fine tell me, then I'll run the program over the whole list (or a better one, if anybody cares to provide one)

鼻子不是鼻子脸不是脸   4340000
包在...身上   4310000
阿狗阿猫   4130000
阿狗阿貓   4100000
表里一致   1320000
表裡一致   1310000
百无一二   1210000
百無一二   1210000
鼻子不是鼻子臉不是臉   1130000
边都沾不上   1040000
邊都沾不上   940000
爱不释手   779000
愛不釋手   771000
比手划脚   461000
比手劃腳   459000
变化莫测   417000
變化莫測   412000
八千里路云和月   395000
八千里路雲和月   389000
別具一格   387000
办事不牢   313000
辦事不牢   310000
八九不离十   299000
按劳分配   270000
按勞分配   269000
彼一时,此一时   230000
彼一時,此一時   230000
標新立異   198000
标新立异   198000
百感交集   186000
蹦蹦跳跳   166000
按兵不动   136000
按兵不動   135000
愛面子   131000
爱面子   131000
杯水车薪   129000
杯水車薪   129000
鼻子气歪了   120000
背水一战   120000
背水一戰   118000
暴跳如雷   104000
閉門造車   98700
闭门造车   98600
抱佛腳   94800
抱佛脚   94500
阿猫阿狗   93000
阿貓阿狗   90600
八九不離十   87700
鼻子氣歪了   86300
班門弄斧   81200
班门弄斧   80800
愛莫能助   77500
爱莫能助   77500
悲歡歲月   73000
悲欢岁月   72600
半斤八两   72400
半斤八兩   71700
百廢俱興   71600
安分守己   71500
百废俱兴   70300
半推半就   67900
白駒過隙   65600
白驹过隙   65500
豹頭環眼   55100
报仇雪恨   54100
報仇雪恨   54000
暗暗自责   53700
豹头环眼   52500
白刀子进,红刀子出   50300
白刀子進,紅刀子出   49900
扮猪吃老虎   48200
扮豬吃老虎   47800
八仙过海,各显神通   44400
八仙過海,各顯神通   43700
百思不解   41600
八竿子打不著   38000
抱头鼠窜   37400
抱頭鼠竄   37300
百口莫辩   33700
百口莫辯   33500
安之若素   32300
拜倒石榴裙下   31100
杯弓蛇影   29300
八竿子打不着   19000
杯酒釋兵權   17900
杯酒释兵权   17800
綁赴市曹   16300
绑赴市曹   16300
杯盤狼藉   15700
杯盘狼藉   15500
安家立業   9960
安家立业   9900
彪腹狼腰   4790
碧眼童颜   2280
碧眼童顏   2280
暗暗自責   1640
豹头猿臂   1480
豹頭猿臂   1470
报雠雪恨   799
報讎雪恨   793

Link to comment
Share on other sites

hmkay, I got something, but before I let it run over all 1600 chengyu, i let it do the first 30. Can someone with more Chinese knowledge than me tell me if they're more or less in the right ballpark, concerning relative fequency?

阿狗阿猫 4130000

Did you search for "阿狗阿猫" or for 阿狗阿猫 (with or without quotes)?

When I search for "阿狗阿猫" I have only 30900 hits using google.

Link to comment
Share on other sites

Very good suggestion, BertR, the numbers have changed considerably:

爱不释手   654000
百感交集   195000
标新立异   188000
蹦蹦跳跳   160000
愛不釋手   156000
按兵不动   127000
杯水车薪   123000
按劳分配   114000
按勞分配   113000
爱面子   113000
背水一战   105000
暴跳如雷   105000
闭门造车   88500
抱佛脚   81900
变化莫测   75100
變化莫測   74300
班门弄斧   73100
安分守己   71500
八九不离十   68500
八九不離十   68400
半推半就   67900
爱莫能助   66900
阿猫阿狗   66000
白驹过隙   62500
半斤八两   62400
報仇雪恨   56600
报仇雪恨   56500
八仙过海,各显神通   42200
八仙過海,各顯神通   42100
百思不解   41500
八千里路云和月   38400
八千里路雲和月   38100
扮猪吃老虎   36800
扮豬吃老虎   36600
抱头鼠窜   34500
安之若素   32400
杯弓蛇影   29300
百口莫辩   29100
阿貓阿狗   27100
愛面子   20800
杯酒釋兵權   18100
杯酒释兵权   18000
背水一戰   17200
別具一格   17200
八竿子打不着   15500
比手划脚   13400
比手劃腳   13400
抱佛腳   13300
標新立異   12600
按兵不動   11900
杯盘狼藉   11400
閉門造車   10800
愛莫能助   10500
半斤八兩   9840
杯水車薪   9470
百廢俱興   8490
百废俱兴   8320
班門弄斧   7840
表里一致   7620
安家立業   7600
表裡一致   7560
安家立业   7510
八竿子打不著   7180
悲歡歲月   6920
悲欢岁月   6880
办事不牢   6700
阿狗阿貓   6630
辦事不牢   6600
阿狗阿猫   6540
豹頭環眼   5720
豹头环眼   5650
白刀子进,红刀子出   4850
白刀子進,紅刀子出   4780
百口莫辯   4490
杯盤狼藉   4260
边都沾不上   3620
邊都沾不上   3440
白駒過隙   3210
抱頭鼠竄   3020
彼一时,此一时   2890
彼一時,此一時   2860
暗暗自责   2460
拜倒石榴裙下   2340
鼻子气歪了   1760
鼻子氣歪了   1740
鼻子不是鼻子脸不是脸   1450
鼻子不是鼻子臉不是臉   1430
包在...身上   1260
百无一二   1200
百無一二   1200
碧眼童颜   584
碧眼童顏   576
彪腹狼腰   490
報讎雪恨   348
报雠雪恨   347
豹头猿臂   295
豹頭猿臂   295
绑赴市曹   242
綁赴市曹   241
暗暗自責   89

Link to comment
Share on other sites

Yes, adding quotation marks would be a must!

One note though: so many chengyu have developed through history that just doing the first 100 of that already small list would not be much of an indication of anything. Remember, like I said earlier, that list contains not only chengyu but any idiomatic expression in Mandarin, so you get entries like 愛面子, 包在...身上, etc which are not chengyu. Still, go right ahead and do the entire list if you can! Jiayou~

EDIT: Also I might add that I think the fact that the list has both trad and simp forms might be kind of interesting as it might highlight differences between mainland China and Taiwan, HK, etc.

Link to comment
Share on other sites

tooironic, it was just to see if the program works before i let it run for an hour to get the result for the whole list ;)

Since it seems to result sort-of-sane results now, I'll let it go over the list. I found out though, that the numbers I get from google are different from the ones you get when you go to google.com. I'll have a look at that. You guys go and find me more and better chengyu lists ;)

Link to comment
Share on other sites

I got around that stupid google api. The results are very different now.. I'll need somebody to have a look at both lists and tell me which one is more accurate in assessing the frequency of the chengyus.

Below the first 100 chengyu of the list sorted by the estimated search results that google gives on google.com:

百無一二   5970000
百无一二   5440000
辦事不牢   4050000
办事不牢   3870000
豹頭環眼   3590000
豹头环眼   3450000
暗暗自责   3340000
百廢俱興   2940000
表裡一致   2840000
百废俱兴   2810000
爱不释手   2810000
边都沾不上   2740000
邊都沾不上   2730000
表里一致   2500000
八竿子打不著   2320000
鼻子氣歪了   1890000
鼻子气歪了   1860000
彼一時,此一時   1390000
八九不離十   1270000
彼一时,此一时   1260000
包在...身上   1160000
百感交集   915000
蹦蹦跳跳   882000
标新立异   870000
拜倒石榴裙下   866000
愛不釋手   832000
按劳分配   775000
按勞分配   774000
按兵不动   631000
暴跳如雷   623000
杯水车薪   608000
爱面子   594000
背水一战   528000
阿貓阿狗   435000
闭门造车   424000
抱佛脚   411000
安分守己   397000
变化莫测   396000
變化莫測   395000
八九不离十   371000
爱莫能助   356000
半推半就   345000
班门弄斧   344000
報仇雪恨   331000
报仇雪恨   331000
半斤八两   330000
白驹过隙   293000
碧眼童颜   290000
阿猫阿狗   290000
碧眼童顏   290000
豹頭猿臂   262000
豹头猿臂   250000
百思不解   224000
白刀子進,紅刀子出   212000
扮豬吃老虎   207000
扮猪吃老虎   207000
抱头鼠窜   205000
八仙过海,各显神通   196000
八仙過海,各顯神通   195000
八千里路云和月   189000
白刀子进,红刀子出   189000
八千里路雲和月   188000
安之若素   177000
杯弓蛇影   162000
報讎雪恨   158000
报雠雪恨   158000
百口莫辩   154000
綁赴市曹   148000
愛面子   142000
绑赴市曹   132000
抱頭鼠竄   118000
杯盤狼藉   117000
白駒過隙   116000
背水一戰   99400
別具一格   95000
杯酒釋兵權   85500
杯酒释兵权   85500
按兵不動   85200
八竿子打不着   84900
抱佛腳   75200
比手划脚   73400
比手劃腳   73300
標新立異   70600
杯盘狼藉   64300
愛莫能助   63100
杯水車薪   62900
半斤八兩   62100
鼻子不是鼻子臉不是臉   58300
閉門造車   57700
班門弄斧   38200
安家立業   36400
安家立业   36300
阿狗阿貓   31700
阿狗阿猫   31500
悲欢岁月   30200
悲歡歲月   30200
百口莫辯   29700
鼻子不是鼻子脸不是脸   24500
彪腹狼腰   16900
暗暗自責   10300

Link to comment
Share on other sites

Here's the sorted list of the 1626 chengyus from tooironic. It's based on the numbers from the google webapi, not google.com. I'm waiting with that until somebody verifies that those numbers are really much more accurate, because going over google.com not only goes directly against the google terms of service, it also takes up to 10 seconds for each chengyu, so it'll take forever, and I want to know if it's worth it.

chengyu_sorted.txt

Link to comment
Share on other sites

phyrex, still working on the MOE list. It would be great if you could run it for me, but I would also be interested in learning about the code. Xiexie.

tooironic, it looks like a nice list. It shouldn't be too hard to cross-reference it with lists that are already widely available. 1,600 is actually an OK number, provided that the list has really been edited with an eye towards frequency, it would be a good number of core chengyu. Though I doubt that was a concern for the wiktionary editors... I got maybe 2000-3000 by culling everything from CEDICT that is marked "idiom", "proverb" "literary saying" and the like, but had to enter a lot of quite well-known chengyu later.

Link to comment
Share on other sites

Very interesting, thanks heaps for your hard work! It makes sense that chengyu like 随时随地, 无论如何, 不可思议, etc would be so prominent as they are very common. It would be great to get some input from native speakers though... :clap

chrix: Yeah, the list from wiktionary is very random, as most people just add whatever entries at the time that interest them. But still, it's a good starting point I think. :)

Link to comment
Share on other sites

yeah, also tooironic, the next exciting question would be, what are the best strategies translators use when faced with chengyu? A lof of chengyu just translate into simplex words in English, some need circumlocutions, and some have "English chengyu" counterparts, and in some cases you could use the "the Chinese have a saying" spiel.

Link to comment
Share on other sites

The code is ugly but easy.

I'll post it here and attach it for easier playing. Comments and criticism are welcome.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import urllib
import os, sys
import codecs
from operator import itemgetter
from xgoogle.search import GoogleSearch, SearchError

# scrape from google.com
def googleWebSearch(currentChengyu):
try:
  chengyuCount = GoogleSearch('"%s"' % currentChengyu).num_results
except SearchError, e:
  print "Google search failed: %s" % e
  sys.exit(1)

return chengyuCount

# use google web api
def googleAPIsearch(currentChengyu):
 query = urllib.urlencode({'q': currentChengyu}, {'hl': 'cn'})
 url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
 search_response = urllib.urlopen(url)
 search_results = search_response.read()
 results = json.loads(search_results)
 data = results['responseData']
 est = data['cursor']['estimatedResultCount']
 return int(est)

#START

if len(sys.argv) < 3:
print "USAGE: chengyu.py listWithChengYus.txt fileNameForSortedChengYus.txt"
sys.exit(1)

#read chengyu list
chengyulist = codecs.open( sys.argv[1], "r", encoding='utf-8') #might have to adjust encoding

chengyu_results = {};
i=1

# get results
for chengyu in chengyulist.readlines():
chengyu = chengyu.strip().encode("utf-8")
       chengyu = '"'+chengyu+'"'
if len(chengyu) < 1:
	continue
chengyu_results[chengyu] = googleWebSearch(chengyu)
print i,chengyu, chengyu_results[chengyu]
i=i+1

# sort list by frequency
chengyu_sorted = sorted(chengyu_results.items(), key=itemgetter(1), reverse=True)

# write sorted list
output = open(sys.argv[2], "w")
for chengyu in chengyu_sorted:
output.write(str(chengyu[0]) +"   "+ str(chengyu[1])+ "n")

EDIT: whoops, embarassing. I forgot the quotation marks again for the google websearch! I fixed it in the quote, but not in the uploaded file. Please note this when playing with the code!

Link to comment
Share on other sites

the next exciting question would be, what are the best strategies translators use when faced with chengyu?

To be honest, I think that question is out of the scope of this forum, unless they create a sub-forum dedicated to translation studies and professional translators were around to answer such questions. I've tried to request one, but it seems the demand is not there. Still, you could create a new topic and see the answers you get.

Link to comment
Share on other sites

well, one question doesn't does not an entire subforum make :mrgreen: is there some literature available on this, I'd like to learn more about it, is all....

roddy was discussing the future of the forums a day or two ago, maybe you can reiterate your request. I'd be happy to second you....

Link to comment
Share on other sites

here are the 'first 100' google.com results, this time WITH quotation marks. Now it looks a bit uglier, but that could be easily fixed if needed.

"鼻子不是鼻子脸不是脸"   223000000
"鼻子不是鼻子臉不是臉"   119000000
"阿狗阿猫"   76300000
"阿狗阿貓"   75900000
"百廢俱興"   12700000
"百废俱兴"   11600000
"包在...身上"   9960000
"比手划脚"   8530000
"百無一二"   7260000
"百无一二"   7250000
"八千里路雲和月"   7200000
"边都沾不上"   6710000
"愛不釋手"   6590000
"爱不释手"   6580000
"八九不离十"   5540000
"彼一時,此一時"   4250000
"邊都沾不上"   3730000
"爱莫能助"   3460000
"彪腹狼腰"   3050000
"按勞分配"   3040000
"表里一致"   2920000
"表裡一致"   2900000
"按兵不动"   2510000
"爱面子"   2430000
"鼻子气歪了"   2370000
"变化莫测"   2230000
"變化莫測"   2220000
"闭门造车"   1830000
"閉門造車"   1830000
"別具一格"   1760000
"阿猫阿狗"   1730000
"拜倒石榴裙下"   1520000
"班門弄斧"   1510000
"鼻子氣歪了"   1380000
"按劳分配"   1100000
"比手劃腳"   985000
"標新立異"   935000
"标新立异"   935000
"八千里路云和月"   917000
"百感交集"   897000
"蹦蹦跳跳"   895000
"八九不離十"   761000
"愛面子"   729000
"阿貓阿狗"   720000
"按兵不動"   710000
"办事不牢"   698000
"辦事不牢"   697000
"抱頭鼠竄"   690000
"杯水車薪"   666000
"杯水车薪"   664000
"背水一战"   624000
"暴跳如雷"   622000
"背水一戰"   619000
"彼一时,此一时"   578000
"杯弓蛇影"   506000
"抱佛腳"   486000
"抱佛脚"   485000
"扮豬吃老虎"   456000
"豹头环眼"   436000
"白刀子進,紅刀子出"   425000
"愛莫能助"   420000
"白刀子进,红刀子出"   414000
"安分守己"   397000
"半斤八两"   393000
"半斤八兩"   391000
"班门弄斧"   382000
"暗暗自責"   350000
"報仇雪恨"   348000
"半推半就"   346000
"暗暗自责"   333000
"杯酒釋兵權"   329000
"报仇雪恨"   326000
"白駒過隙"   311000
"白驹过隙"   310000
"扮猪吃老虎"   231000
"百思不解"   224000
"抱头鼠窜"   222000
"八仙过海,各显神通"   213000
"八仙過海,各顯神通"   212000
"百口莫辩"   184000
"百口莫辯"   183000
"安之若素"   177000
"悲欢岁月"   166000
"悲歡歲月"   166000
"八竿子打不著"   141000
"豹頭環眼"   126000
"綁赴市曹"   105000
"八竿子打不着"   92100
"杯酒释兵权"   85000
"杯盤狼藉"   84900
"杯盘狼藉"   84700
"碧眼童颜"   72300
"碧眼童顏"   72300
"绑赴市曹"   68500
"豹頭猿臂"   68200
"豹头猿臂"   67300
"安家立業"   51300
"報讎雪恨"   49800
"安家立业"   48400
"报雠雪恨"   46000

EDIT: turns out, those numbers are not what google actually says. To get what google actually says, one MUST NOT add the "". *sigh*. Can't anybody settle on a standard? :(

Link to comment
Share on other sites

well, one question doesn't does not an entire subforum make is there some literature available on this, I'd like to learn more about it, is all....

Hehe, and neither does just a handful of interested people. Honestly, the more I think about it, the more I realise that an entire subforum wouldn't really be appropriate here. From what I can tell, there are only about half a dozen professional translators who are active posters here, and, at any rate, that does not guarantee that they would be willing to discuss anything beyond language issues, which are adequately covered in the other forums. But I'd be happy to discuss it further with you via email or MSN. I'll PM you. :)

Link to comment
Share on other sites

tooironic, well I know there's the fanyi mailing-list, but if you half a dozen translators on this forum, that's already a but more than those interested in Classical Chinese. But we have been quite active and it seems that roddy is gonna give us a subforum :clap . So it's not so much about the number of people but more about constant activity over a period of time...

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...