Frequently used chengyu project

November 23, 2009 at 06:35 PM

How about the idea of creating a public Google Doc for people to contribute translations?

I'm not sure a doc is the correct approach; does Google Docs have any way to avoid race conditions. [That is, say two people want to update the document at about the same time. Both download the document, make their changes, and then upload the new version. Since they both download the same old version, the first person to upload their changes will have their changes lost.]

Seems that something with better support for multiple edits, e.g. a wiki, is more appropriate. But that has the problem with how does one convert that into some format one can feed into a flashcard program.

November 23, 2009 at 06:38 PM

@chrix

Also, while setting this up, I noticed some other mistakes:

How do you want to handle changes to our multiple versions? You have the "official" version, I have the one in ZDT format, and renzhe is making one as well. Plus all your other fans with their own copies. Having us all make separate changes is timing consuming and will lead to some divergence. But it's somewhat time consuming to do the conversions. If you can make some changes to your version we can probably make the conversion easier.

November 23, 2009 at 07:23 PM

@numble: yes, that's right. But I'm not going to enter thousands of chengyu by hand. If I use a XLS file I already have, the maximum file size is 2MB.

@jbradfor: google docs actually works well for simultaneous editing. I've done it for discussing the Zuozhuan....

As far as synchronisation goes, I see several problems here:

1. I already have multiple databases on my hands: a) the original chengyu master database, B) the frequent chengyu database normal version, c) the frequent chengyu database - trimmed down to 140 chars for twitter version. Already a lot of hassle to keep track of these three... I plan to write some scripts to automatically export stuff from the main database, but I haven't gotten around to doing so yet. However, another problem is that right now the database is nonrelational and getting more and more inefficient. So I'd need that done first.... Right now I couldn't even automatically recreate an updated version of the 1,424 frequent chengyu list...

2. as far as anki goes, I think I can share a deck publicly, so every time I synchronise it, it should be synchronised for everybody sharing as well. The only thing I need to find out first, though, if this synchronisation is unidirectional or bidirectional.

3. There's a general problem with updates I see, at least with anki, maybe it's easier with ZDT: what do you do if you have an updated file, can you import it into anki without losing your SRS information? It would be a shame if you had to start all over again...

November 23, 2009 at 08:05 PM

3. There's a general problem with updates I see, at least with anki, maybe it's easier with ZDT: what do you do if you have an updated file, can you import it into anki without losing your SRS information? It would be a shame if you had to start all over again...

Given my recent correct ratio, it would be a blessing, not a shame....

Seriously, that's a good point as well. In ZDT, when you "backup" a flash card set, it also includes your SRS information. From there, one would probably need to use a script to merge the two, i.e. take the pinyin and defs from one, the SRS info from the other, for a given characters. [Different characters would probably need manual intervention, but I'm assuming almost all the changes will be to the definition, and secondary to the pinyin.] I'll be glad to write the script, only a dozen lines in gawk.

And it seems that google docs handles my concern just fine. What I forgot is that in google docs, one edits online as well, it's not a storage place where you upload and download....

November 23, 2009 at 08:08 PM

@numble: yes, that's right. But I'm not going to enter thousands of chengyu by hand. If I use a XLS file I already have, the maximum file size is 2MB.

I took the ZDT version I had, and created an XLS version of that (using OpenOffice ). It's 380 kB.

LATER EDIT: I gave it a try, here's the link: http://spreadsheets.google.com/ccc?key=0AvT_yijZxyXudC1pZGhKSGdvMktCclg5eEdtRU9KY2c&hl=en Does this work? I really should get back to work....

Edited November 23, 2009 at 08:55 PM by jbradfor

November 23, 2009 at 09:22 PM

Oh yes, but my master file has 8587 entries, and 54 columns altogether, so this is already more than 400,000 cells... (Another problem is what I descriebd in my email to you, when I convert Excel to txt, I can't get it do convert to a tabbed UTF-8 file, a CSV is out of the question because some fields contain longer texts and thus potentially commas.) Ultimately my aim to create an online database, but until that happens I'd be glad if we could find a good way to do this without having too much trouble synchronising different versions...

In google docs you can assign viewing and editing privileges, so that shouldn't be a problem.

November 23, 2009 at 09:26 PM

well, if there was some uncomplicated way to sync the master database with the file you created it would be perfect....

EDIT: the way I see it, for the time being, the easiest solution would be for me to write a script that matches the updates from the google docs with the my main database...

November 23, 2009 at 10:41 PM

If you can split the master list into something google docs can managed (e.g. remove / combine fields, split into two docs), everything else seems like it can be overcome.

Another problem is what I descriebd in my email to you, when I convert Excel to txt, I can't get it do convert to a tabbed UTF-8 file, a CSV is out of the question because some fields contain longer texts and thus potentially commas.

Can you do a search-and-replace for commas to something you'll never see (e.g. zzzzz), then export as CSV?

November 23, 2009 at 10:59 PM

If you can split the master list into something google docs can managed (e.g. remove / combine fields, split into two docs), everything else seems like it can be overcome.

well, it's one database. Splitting them like this will make it less manageable. There's another problem with copyright issues. Some of the fields are cut and pasted from copyrighted sources and should probably not be part of a publicly available database.

So I see creating a script that matches the frequent chengyu database as the only way to it for the time being. I would include the following fields:

- fantizi

- jiantizi

- numbered pinyin

- accented pinyin

- English

- German (because a lot of entries are available from Handedict and my own data entry, one should as well make us of it)

- Chinese: because the English translations are often less than ideal, I find it very important to have an additional explanation in Chinese. Probably would use the source gato posted earlier in this thread for this (yet another script!)

- Period and Source: for me this is essential for learning chengyu, to note where there are from, especially if the chengyu in question is from a classical source

- HSK level: might benefit the HSK learner.

- Example sentence: I've copied some stuff around from the internet, but I'm not sure if example sentences might actually be copyrighted or not. One could include stuff from google, but pedagogically the best thing would be to come up with nice example sentences for the learner. (or you just use the Chinese example plugin in anki )

- remarks: for any remarks any contributor might have. If some entry is questionable etc.

there are other fields (synonyms, frequency data etc) that could be included, but I think the focus of this subproject would be on proofreading the pinyin and improving the quality of the English translations.

What do you think?

Can you do a search-and-replace for commas to something you'll never see (e.g. zzzzz), then export as CSV?

Yes, that's a good way to do it, but I've got a lot of fields and wouldn't want to have to do this every time I update the database (so I wrote a script to create the tabbed text file I sent you).

November 24, 2009 at 03:32 AM

What do you think?

Works for me!

November 25, 2009 at 12:56 PM

Working on it... Already imported the Hanyu Chengyu Cidian data, adding 7,600 additional entries to the masterlist

While we're at it, here are some "candidates for exclusion" from the frequent chengyu list:

• 精衛填海

• 女媧補天

• 利令智昏

• 圖窮匕見

• 項莊舞劍

• 背水一戰

• 巾幗英雄

• 董狐之筆

• 斷頭將軍

• 焚書坑儒

• 坦復東床

• 鱗次櫛比

• 穿針引線

• 酒池肉林

And some candidates for inclusion

• 邯鄲學步

• 百尺竿頭

• 點鐵成金/點石成金

Do you know how to update anki decks with keeping the scheduling information intact? As far as I understand, if you just reimport the text file it will be lost....

Edited November 25, 2009 at 04:47 PM by chrix

November 26, 2009 at 10:15 PM

Just FYI, CSV doesn't always mean it's actually comma separated - it could be pretty much anything. Colons, semi colons, tabs, whatever. Moreover, it is common practice to enclose the actual fields that could contain commas in "", to mark that this long text belongs together. So, all in all, CSV files are absolutely up to the task, and IMHO the best choice, because they can be imported into pretty much everything.

As to Anki: Internally that works on some sort of SQL database, so even though there is AFAIK no standard way to import changed stuff, it shouldn't be all that hard to write a script to go go through an updated CSV file and overwrite the relevant database fields, without changing the SRS information.

BTW, I believe the Anki shared decks are unidirectional.

November 26, 2009 at 10:41 PM

As to Anki: Internally that works on some sort of SQL database, so even though there is AFAIK no standard way to import changed stuff, it shouldn't be all that hard to write a script to go go through an updated CSV file and overwrite the relevant database fields, without changing the SRS information.

But how do you get the CSV file from the Anki deck, and then back again, without changing it?

I've found a quote of the master himself:

It's an sqlite database. There are a myriad of tools available to access the
file, though you shouldn't need to edit the file directly under normal

circumstances. Your file is not 'corrupt' per-se - the data should be fine.

But the 'current model' is pointing to a model which has been deleted. If

you want to fix the earlier problem yourself, set 'currentModelId' in the

'decks' table to a valid model id.

But how easy would it be?

November 26, 2009 at 10:49 PM

You don't get a CSV from Anki. You take the CSV from your database, and then use it to directly manipulate the Anki file. Since you only touch the 'contents' fields, it should leave the SRS information unchanged.

I don't know how hard that would be, but purely from the logic behind it it doesn't sound like a very big deal. Nothing to hack out in an hour, but not impossible either

November 26, 2009 at 11:11 PM

I just tried reimporting stuff (I created a fact "test" in a new deck and imported it into an existing one. Then I changed the fact "test" in the new deck and reimported it). It wouldn't update, just disregard it since otherwise it would create a duplicate

November 30, 2009 at 03:54 PM

Found an error: on the second entry (一哄而散) the tone on 哄 should be first, not fourth. [bTW, CEDict had this wrong, submitted correction to MDBG.)

In general, how do you want to handle corrections? Save for now, and worry about them later? Post here? PM you?

November 30, 2009 at 04:06 PM

I wouldn't mind if you posted them here... I don't know if you saw that or not, but I keep adding additional errors to post 117 in this thread.

Right, a lot of frequently used chengyu are not in CEDICT, and I have entered some hundred of them myself into the database. How do you contact them about this? Probably something I would do once the database is completed. I also fixed some spelling mistakes and the like, I'm not sure if I always consistently noted them for further reference though.

As far as 一哄而散 goes, 哄 is a 破音字, and the dictionaries don't seem to agree, some have it in the first tone, some in the fourth. All TW sources have the first, the majority of BJ sources seem to have the fourth, so it's one of those cases where the Taiwan standard reflects the older usage in the BJ standard. I've made it a point to note all these tone differences in my database between the TW and BJ standards, but CEDICT doesn't seem to do that.

November 30, 2009 at 04:15 PM

You can contact the maintainers here.

Alternatively, if you use the dictionary at mdbg.net, there is an edit button after each entry, allowing you to submit changes. If something is not in the database, you can submit a definition.

It is usually accepted once it passes a review process.

November 30, 2009 at 04:22 PM

yeah, at some unspecified point in time, I will contact them. The only time I use the online site is when I'm outside, away from dictionaries and computer.

Oh right, a progress report from my part:

- mapped the google data onto the main database.

- mapped the Hanyu Chengyu Cidian back onto the MOE data (so I'll probably soon mail roddy the updated version of that list)

STILL NEEDING TO BE DONE:

- marking the chengyu from the frequent chegyu list in the main database

- write a script to export the data for the "Frequent Chengyu on Google Docs" project

- figure out how to import things into anki with leaving the scheduling info intact

And we should probably start discussing in what form we would do that google docs project.

Should everybody get editing rights, or we ask for a show of hands here and then give out the password by PM? Probably safer that way. If occasional posters would like to contribute, they could leave comments on this thread, or ask here...

November 30, 2009 at 07:35 PM

As far as 一哄而散 goes, 哄 is a 破音字, and the dictionaries don't seem to agree, some have it in the first tone, some in the fourth. All TW sources have the first, the majority of BJ sources seem to have the fourth, so it's one of those cases where the Taiwan standard reflects the older usage in the BJ standard. I've made it a point to note all these tone differences in my database between the TW and BJ standards, but CEDICT doesn't seem to do that.

For this character, CEDict provides three different tones, with different definitions: http://us1.mdbg.net/chindict/chindict.php?page=worddict&wdrst=1&wdqb=%E5%93%84 . And, to make things more confusing, only the fourth tone version has a different simplified version. [if you think this is wrong and can provide evidence, please submit a correction.] CEDict does try to provide TW vs BJ versions, e.g. see http://us1.mdbg.net/chindict/chindict.php?page=worddict&wdrst=1&wdqb=%E6%9C%9F , but I'm sure it's far from complete.

I actually learned the third tone version first, as in to take care of or amuse a child. From some ChinesePod lesson.

But, I fear I've taken us further off-topic. Back on topic, for weird cases like this, I think it's most useful to just pick one, at least for the "basic" version that most people would find useful. For "advanced" users, you could provide these details.

Sign In

Frequently used chengyu project

Recommended Posts

jbradfor

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

phyrex

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

phyrex

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

renzhe

Link to comment

Share on other sites

chrix

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

Join the conversation