Jump to content
Chinese-Forums
  • Sign Up

Frequently used chengyu project


chrix

Recommended Posts

I already pretty much got everything from CEDICT and HANDEDICT. That's how I started my list. CEDICT, however, lacks a surprising number of common chengyu, though they do have a lot of them too, to be fair.

By Wikipedia I suppose you mean wiktionary? Tooironic provided a list of the 1,500+ entries. Do you think you could provide an Excel table reflecting all the information in wiktionary? Otherwise it might be hard to export.

漢語辭典: where can I get a machine-readable version of this? I would be most grateful to know...

The best sources for chengyu are, in my opinion:

dict.idioms.moe.edu.tw however, no English

www.chinese-tools.com seems to have a list that is the same source as www.zdic.net/cy just with some English translations added. I suppose the person running the site got the Chinese data and English data from somewhere else, because there are glitches in some entries indicating that the information was not entered manually...

Link to comment
Share on other sites

chrix, then I guess I've got a present for you:

http://stardict.sourceforge.net/Dictionaries_zh_CN.php

and

http://stardict.sourceforge.net/Dictionaries_zh_TW.php

I don't know which of the listed dictionaries has chengyu (but by having a quick look at the page I spotted, for example, 汉语成语词典), but you seem enthusiastic enough to go through them :)

They all seem to be free to use (GPL, free to use, CC, ..) and are there in the stardict format. I don't know too much about it, but apparently there exist some converters, so it's probably not too hard to work with.

I don't know in which of the wikimedia initiatives i'd find chengyus, but you can download all the corpuses (corpi?) [if you have enough disk space :) ] and then work with it, so if you think there is valuable information in there, we could find ways.

Do the websites you mentioned (thanks, btw) list any sources?

Link to comment
Share on other sites

hey, that's great, it's got plenty of my favourite dictionaries in digitised versions :clap

http://www.zdic.net/cy/ doesn't really tell you about its sources, but it's all Creative Commons at least. So CC means extracting their content automatically is ok?

For chinese-tools, I can't find anything either: http://www.chinese-tools.com/chinese/chengyu/dictionary

I've got some more links on my blog:

http://chinesischblog.wordpress.com/%E6%88%90%E8%AA%9E/ and http://chinesischblog.wordpress.com/%E6%88%90%E8%AA%9E/chengyu-listen/

The dictionaries listed on the right hand sidebar are useful as well, though many seem to just use the sources you just posted above :mrgreen:

Link to comment
Share on other sites

I'm feeling a bit stupid, but how do I quote in this forum?

Hrm, anyways:

>http://www.zdic.net/cy/ doesn't really tell you about its sources, but it's all Creative Commons at least. So CC means extracting their content automatically is ok?

I suppose so. But if it's CC, they also should be nice enough to give you their database, if you're asking nicely. And that would be *so much* easier than scraping it from their website (= pretending to be a human with a webbrowser)

Link to comment
Share on other sites

So CC means extracting their content automatically is ok?

Extracting contents automatically is ALWAYS ok, the only issue is whether you can redistribute the results (and the original data). This seems to be a no-derivatives license, which implies that you can't distribute the data you extract automatically.

Whether anyone would actually go after you is a different question. If you only use small parts, you'd fall under "fair use" or similar, if you extract translations/explanations en masse, then it's questionable. If it's for personal use, you're fine, regardless of the license.

Another issue is how you do your processing. Whether CC or not, most website admins will not appreciate it if you send 50,000 separate requests to their website within an hour. It's generally better to download the whole dictionary in an electronic form, and do the processing on your computer, as you save everyone a whole lot of bandwidth. You can do this with CC-CEDICT, I don't know if you can do it with zdict, you probably can.

Link to comment
Share on other sites

no it's not. For example google EXPLICITELY prohibits doing this

They prohibit using their web interface for this purpose.

Once you have the data (like a dictionary) on your hard disk, you can do whatever you want with it. Even if you don't own the copyright. Copyright governs copying, not use, and any such restriction is questionable legally.

Link to comment
Share on other sites

gato, this looks very promising. I'll have to strip out the HTML and then write a script for importing it though, since the entries have a different number of categories. It's similar to the unihan database, for which I wrote a script once so I could import it into Excel, so it shouldn't be too complicated...

Link to comment
Share on other sites

They prohibit using their web interface for this purpose.

Once you have the data (like a dictionary) on your hard disk, you can do whatever you want with it. Even if you don't own the copyright. Copyright governs copying, not use, and any such restriction is questionable legally.

Fair enough. I didn't even consider that point, what with not being able to get to the data automatically without going through their web interface and all :)

Link to comment
Share on other sites

hey phyrex I was just wondering if you could take the list that you have of only the 四个字 and convert it to simplified and then rerun the google search? Or I can do it... but then I would have to send you the file and I don't know that that is worth it. On the other hand there will be some simplified conversion errors and in that sense I know it would probably be better if I do it since I don't want to have to make you go through and check for those. Let me know how that could work.

Thanks!

Link to comment
Share on other sites

if you could take the list that you have of only the 四个字 and convert it to simplified and then rerun the google search?

I think google does an internal conversion to simplified or traditional when you a search in Chinese. The number of hits are very similar whether your keywords are in simplified or traditional.

Link to comment
Share on other sites

There would be differences though if you search google.com.cn and google.com.tw (or even google.com.hk) separately.

That's probably due to censorship on google.cn.

Ok, I had thought that in some previous searches I did that there was a large difference. But I just checked and the results were exactly the same. Nevermind

Yeah, it's the obvious thing do if the google guys knows anything about Chinese. 不要小看谷歌哦。 呵呵

Link to comment
Share on other sites

Fair enough. I didn't even consider that point, what with not being able to get to the data automatically without going through their web interface and all

Yes. Downloading the Google database is not exactly an option :mrgreen:

On the other hand, downloading CC-CEDICT is, and this is what I use for the word lists in the First Episode Project.

Link to comment
Share on other sites

Here's an anki set for 1,424 frequent chengyu. I've created the list manually based on a variety of lists (it includes all HSK chengyu, the Singapore ones, all MOE chengyu with a count above 20, the 200 most common chengyu from the newspaper corpus, and also some 10 other lists). I'm not fully satisfied with it yet, but I believe it's a good point of departure for devising such a list... Tell me though, if there's some absolutely essential chengyu missing :mrgreen:

frequent chengyu.zip

  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...