Jump to content
Chinese-Forums
  • Sign Up

Adso - Traditional Character Support and Conversion Tool BETA


trevelyan

Recommended Posts

Thanks to everyone who helped check the ambiguous traditional and simplified entries for the Adso project. I'm pleased to announce that we can now offer support for the traditional character set, and also handle conversions between simplified and traditional characters:

http://www.adsotrans.com/new2.html

In case it isn't obvious, Adso is not using character-level conversion tools for what we are doing here. All of the conversion is handled at the word-level.

This has a few useful features. First of all, it means that the system won't query you to manually correct or verify the conversion of words containing ambiguous characters like 头发 or 发展. If we already know the word we can provide the proper encoding. And if we don't know it we only need to make a mistake once -- please let me know of any problems if you find them. A second advantage is transparent support across our variety of annotation options. So its possible to Adsotate a simplified text and have the software churn out complex characters, etc. etc. People can convert encodings by selecting the "Echo Chinese" option.

I hope this is useful for people. One small note -- this interface is a bit of a BETA service. The backend system powering it isn't fully integrated with the rest of our system yet. So you can't update the backend database yet, and full webpage processing isn't online. I'm working towards putting all of this is place and switching everything over to the new database, but this is a start.

Link to comment
Share on other sites

  • 2 months later...

Travelyan, I have a question for you. First off, let me say that as a former programmer, I appreciate the kind of effort you've put into this project. If it seems like I'm complaining please don't take it the wrong way. Nobody would even bother to gripe about a tool if it weren't useful to begin with.

My question is this: why isn't adsotrans in TRADITIONAL characters on the back-end? Since it's often a two (or more) to one mapping to go from simplified to traditional, the current set-up will cause no end of headaches for you. But, since there is only ONE way to simplify any given traditional character, going the other way isn't that hard. If the characters were stored based on traditional form and the simplified form were an additional field of data, you could accept definitions of any traditional character and have them applied to both versions of the character. Wouldn't doing it that way save you quite a few hassles and prevent the horrible butchery of the word parsing that many of us have been seeing ?

Link to comment
Share on other sites

On the surface of it, the problems you have in the song lyric annotation don't seem to be related to traditional/simplified conversion at all.

There's the resultative complement issue, a few mis-parsings, and a non-error (女主角 has the standard pinyin, nǚzhǔjué; the singer fits the rhyme in the song using a common alternative pronunciation).

Link to comment
Share on other sites

Could you add Big5 support? Most text I read is written in Big5. Of course there are plenty of programs to convert to unicode, but if I have to convert, I could just as well convert to GB. Not a big deal, but it would be nice.

Link to comment
Share on other sites

Constructive criticism is never taken badly, and hopefully this response won't be read as aggressive so much as just straightforward about the challenges we face:

My question is this: why isn't adsotrans in TRADITIONAL characters on the back-end?

Adso started using perl and I was reading simplified texts. There were technically good reasons to avoid Unicode when switching to C++. As Joel mentions, this doesn't really affect the quality. The only real downside is that additions to the simplified database are not immediately converted to complex.

From the linked post:

Readers who aren’t sure what I’m talking about, compare this page to the original post, where I’ve corrected these problems. If anybody, especially Travelyan, has an idea what went wrong, I’m all ears. I’ve never seen Adsotrans butcher such simple text before.

There are many reasons why annotations/translations are can be mangled, very few of which are related to traditional/simplified support. First of all, text which is not properly punctuated properly such as the text provided in that post is simply more difficult to parse. Figurative and poetic writing is harder to handle as well. To put the task in context, consider the Google translation of the same text:

Taste better grades must be high name plates most important pure artisanal Ingalls is currently building given restricted marketing others buy spend their money-ray put cards illegally what I both want to to make others see me the pride of I was leading actress because I know that men's vision bad was why it would more attention to facial because I know that no money, everything will Office everybody only say that have money looking very well build wonderful protruding after kiu Korea buy account to lead the tide lay stress on style to France run wish to lead the fashion pursuit of the New Tide would have to spend a banknotes to make others see me the pride of I was leading actress because I know this world pell-mell was why it would more attention to facial because I know have money, everything will Office everybody only say that have money looking very well listen to Who in wrong listen to the Who in the side singing (I is I) listen to him say there is no wrong love wonderful sincerely find there is money looking at moral degeneration is getting worse day things are unpredictable money will not old because I know that money everything will run we will say that the rich rather because I know that money everything will run we will say that the rich rather money everything will run we will say that the rich rather because I know that money everything will run we will say that the rich rather rich really pretty good money money gave me money I told you go

The most important source of errors in annotation/translation come from limitations in our backend database. Beyond CEDICT, there are no good resources freely available from the linguistics or MT community, so we are having to generate information on parts of speech for words ourselves. Missing entries or misclassified words can mislead the software. The system is best at texts people read as these are the texts where mistakes get correctly quickly. If people wish the software to be better at reading poems or songs, they should use it to read poems or songs and get involved in actively correcting its mistakes.

To respond to other specific points: 買不到 and 買得到 were automatically identified as verb constructs since they had not been explicitly added to the database. I have just done this ("to be unable to buy", "to be able to buy"); so they should annotate correctly in the future. Although Adso currently unifies verbs like this it does not yet alter the basic definition to avoid making more problematic mistakes. Suggestions on how to deal with the entry 要好 are welcome. Does deleting the entry make sense? 錢都給 is not a name, and knowing that it is being identified as such is useful for tweaking the name algorithm: while 钱 is a legitimate last name we obviously don't want names which end with 给.

Generally speaking, problems are dealt with as they are pointed out. People are welcome to send me emails of text which gets butchered particularly horribly and I'll do my best to tweak the system to improve things as I have the time. Or make a post on NewsinChinese or something. Few people do this and this is not criticism: people are busy. And there's not much I can say otherwise. Machine translation and annotation are not trivial issues. Am trying to put in place a system and resources that will make it easier for people to study Chinese and create resources for other projects -- the speed at which everything will improve is basically dependent on how many other people get involved.

Could you add Big5 support? Most text I read is written in Big5. Of course there are plenty of programs to convert to unicode, but if I have to convert, I could just as well convert to GB. Not a big deal, but it would be nice.

This is going to require writing the code to segment Big5 text and adding an extra index field to the database. It is on the agenda, but is not an immediate concern because I simply don't have any time for big projects right now. That being said, you should be able to copy Big5 text right into the window on the advanced page right now -- it will automatically be converted to UTF8 by the browser.

Link to comment
Share on other sites

That being said, you should be able to copy Big5 text right into the window on the advanced page right now -- it will automatically be converted to UTF8 by the browser.

Thanks. The problem I was having was not because of the traditional characters. If I paste just a line of text, it works fine. If I copy a whole email into the window, nothing happens when I press the adsotate button. Maybe it's too much text or there is some problem with the format.

Link to comment
Share on other sites

Maybe I wasn't clear enough when I explained what happened on my page. After running the song through adsotrans and getting all of those errors, I tried converting it to simplified characters and doing it again. Many of the errors did NOT happen, then. For example, 買不到 and 買不到 were both correctly adsostated in simplified characters, and both were cropped down to 買 when run through adso as traditional characters. I found the actual cropping of characters to be the most disturbing error by far.

There are errors cropping up due to using traditional characters. I know songs are hard, but adso did better with the exact same song in simplified. Is there something I'm missing here?

(女主角 has the standard pinyin, nǚzhǔjué; the singer fits the rhyme in the song using a common alternative pronunciation).

Hmm... I had a hard time believing this, so I looked in to it. In Taiwan, the standard pinyin for 女主角 is nǚzhǔjiăo. It appears the unfortunate truth is that once again, Taiwan and the mainland have different standards. The same is true of the tone on 法國. This sort of problem could be solved by holding different pronunciations for traditional and simplified characters, but it's probably not worth the work.

Link to comment
Share on other sites

If you don't understand what I was talking about in my last post, go to adsotrans.com and adsostate "買得到". Make sure encoding in and encoding out are both set to UTF8 (trad).

Adsostate will return the sole character 買, but it will have pop-up pinyin for all three. Here is the page source returned:

Now use adso on the exact same phrase, but use UTF8(simp) for ecoding in and out. That works (except for the pinyin selected for 得). It certainly doesn't munch off any characters, like it does if you use traditional characters.

买得到

Link to comment
Share on other sites

weixiaoma,

After running the song through adsotrans and getting all of those errors, I tried converting it to simplified characters and doing it again. Many of the errors did NOT happen, then.

Ahhh.... this happens when simplified text is output in UTF8 too, which suggests this is a hangover from adding support for multiple encodings. Something is probably making a call to an older function in the verb class.

Thanks for the detailed explanation, and I'll do what I can to fix the problem quickly.

Link to comment
Share on other sites

Taiwan and the mainland have different standards. The same is true of the tone on 法國. This sort of problem could be solved by holding different pronunciations for traditional and simplified characters, but it's probably not worth the work.
It might be interesting, sometime, to add additional database information about regional and dialectic pronunciation. Since there's no 1-1 relationship between the script format and any particular region's pronunciation (that is, not all readers of traditional characters pronounce them like people do in Taiwan), what might be better would be another selector - "Adsotate using Chaozhou pronunciation" or something.

But that'd be an immense amount of work, and probably would require more than just a simple tinkering with the database back-end.

Link to comment
Share on other sites

It might be interesting, sometime, to add additional database information about regional and dialectic pronunciation. Since there's no 1-1 relationship between the script format and any particular region's pronunciation (that is, not all readers of traditional characters pronounce them like people do in Taiwan), what might be better would be another selector - "Adsotate using Chaozhou pronunciation" or something.

We could add as many different pronunciation fields for any word as are needed. I've been thinking it might make sense to add Cantonese pronunciation next, actually.

That being said, this is a lot of work and will have to wait for someone who is both familiar with alternate dialects and willing to do a significant amount of database editing -- at least in providing specific pinyin for cases where the pronunciation differs from standard mandarin, or can't be automatically generated.

Link to comment
Share on other sites

That's probably true even among the overseas communities as well.

But it's not hard to think of situations where someone has a traditional character text and expects majority-standard Mandarin (or the reverse) - give a Taiwan resident a simplified text, and he'll still read it in Guoyu standard pronunciation. Besides, several current applications of Asdo are oriented at language learners, and a certain percentage of Chinese instruction outside the PRC takes place in traditional characters despite using PRC-official pronunciation.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...