Thoughts on implementing 簡體字 <---> 繁體字 conversion system

August 15, 2005 at 02:56 AM

Lately i've been looking into the different ways web sites handle real-time Chinese to Chinese conversions. From what i've read, there are two different systems:

1. GB <---> Big5

2. Unicode <---> Unicode

The first method appears to be more prevalent, and easier to do. Xinhua, and many other news sites use this method to do the conversion.

There are many scripts freely available (libiconv , one in php (trad-simp) , (simp-traditional)) available on the web to handle conversion between two different encodings.

The second method is trickier, and i havn't seen as many sites using it. Wikipedia is one web site that does use solely unicode. There are also scripts available for handling unicode (simp <--> trad) conversions. The wikipedia conversion system is based on an open-source script, HanConvert . I havn't looked into this alot yet however, but the wikimedia site gives a pretty good overview of how it can be done- the next step would just be to find a easy to implement script that does it.

The next question you have to answer is how to handle the database? For obvious reasons it would not do to have two databases, one for traditional and one for simplified, but then how would you allow users to interact with a single database using both simp. and trad.? The best way i think to get around this would be to have a trad database in the backend. This is because it is easier, due to the non one-to-one correspondence between characters, two convert from traditional--> simplified.

Whenever a user adds a word, it would first be converted to traditional (if not already), and stored into the db. Then when user's are viewing any given page, the page can load in either simplified, or in traditional based on their preferences. (just have a 簡體字 / 繁體字) button at the top, and then whenever the user select's one, can create a cookie to save the preference for as long as they are viewing the website.

One other consideration is vocabulary differences between taiwan / mainland, etc. Many of the conversion scripts also convert word meanings. For example- "電腦" & "計算機".

I think it would be best however to exclude this type of conversion--> it would be better just to display the page exactly how it originally was, only with the specified encoding (simplified or traditional).

Any thoughts?

References:

micro tutorial: what's the difference between simplified & traditional chinese, and are they separate in unicode?

FAQ: Using HTTP and meta for language information

Automatic font assignment for CJK text

FAQ: Monolingual vs. multilingual Web sites

Wikimedia - Automatic conversion between simplified and traditional Chinese

The Pitfalls and Complexities of Chinese to Chinese Conversion

Encode::HanConvert - Traditional and Simplified Chinese mappings

Wikipedia - Chinese character encoding

繁体中文转换为简体中文的PHP函数

简体中文转换为繁体中文的PHP函数

August 15, 2005 at 07:41 PM

Keith,

I agree about excluding word usage from conversion tools -- the entire idea seems a bit silly. Converting 電腦 to 計算機 seems tantamount to translating "mate" to "friend" when shifting between British and American english, at least.

That being said, I think focusing on character encodings is a bit distracting too. Whatever encoding one chooses, its almost always possible to do a forced character-by-character conversion by jumping between character encodings. All of the conversion tools of which I'm aware do this sort of conversion and sacrifice accuracy for ease. Wenlin at least is aware of potentially ambiguious characters, and asks for manual confirmation of the words containing them.

So for flawless text conversion a prerequisite seems to be developing software that knows the difference between things like the 发 in 发展 and 头发. This means developing a backend database that stores simplified and complex variants side-by-side, so that if you know the simplified for any word you can look up the corresponding complex version.

I've actually been hoping to get some of these issues solved by adding support for fantizi to the Adso database. It would be easy to computer generate the fantizi variants for most of the database entries. They could be stored in whatever encodings people would find useful, and there's no reason each entry in the database couldn't be expanded to include multiple fields storing the binary representations of each word in different encodings (ie. CHINESE_UNICODE, CHINESE_BIG5, CHINESE_GB2312, etc.). The issue holding things back is that I'm not in a great position to be checking the resulting output, and manually resolving the ambiguous entries.

If we had this sort of data in the database though, it would be very trivial to modify Adso to provide text conversion between the various character sets, and probably make the software more useful to those studying classical texts and learning Chinese in Taiwan.

Thoughts?

August 16, 2005 at 04:01 AM

If you want to have the simplified and traditional variants side-by-side in the backend, how would you want to handle accepting & modifying entries? You would want to have the fields linked together so that when one is modified, the other is also modified accordingly. Similarly, if someone wants to add a new entry, you would want to have the corresponding entries automatically created as the entry is added. (So that if someone creates an entry in simplified for 总统, the traditional form 總統 would be added automatically).

Is this what you had in mind?

If the biggest thing holding you back is cleaning up mis-conversions after the initial addition of 繁體字, i'd be glad to help out with manually checking some. I would say though not to worry so much about having a few conversion mistakes- most scripts or software out there now will probably at least do an okay job initially, then after that user's can just fix other mistakes as they come accross them.

What i could do to help out though would be find a list or table of the more common one ->many mappings, and then just throw the character into with database along with a wildcard, %, and verify whatever comes up.

On another note, you ever consider implementing encoding detection? There are a few websites out their (mandarintools (perl5), mandarintools (Java), one in C ) with scripts to detect encoding. This would be helpful for user's who do not know what encoding the text they are pasting into adsotrans is, as well as for lazy users who don't like to select encoding from a pulldown. : )

On another note, here is one more GB <--> Big5 project i stumbled upon on Google.

Ziling

August 17, 2005 at 09:23 AM

Encoding detection would be nice. For now the browser automatically converts any simplified text to GB2312 when submitting it the Adsotrans site, so the only problem is if someone submits complex text, or requests webpage processing while specifying the wrong encoding. When the database and software can actually natively support other encodings it would make sense to add the feature.

If the biggest thing holding you back is cleaning up mis-conversions after the initial addition of 繁體字, i'd be glad to help out with manually checking some.

Actually, that's exactly why I've been delaying. One of these "instant work with no instant gratification" projects....

I'll be travelling home for a week or so next week. What I'll do while there is regenerate the database creating space in it for word variants in each of the major encodings: (1) GB2312, (2) UTF-8 (simplified), (3) UTF-8 (complex), and (4) BIG-5. If there are any other fields/encodings people would find it useful to have in the database please let me know and I can add them at the same time.

If you can provide a list of the characters (in simplified form) that need to be manually disambiguated, I can do a manual dump of all characters in the database containing those characters -- along with their computer-generated complex forms. Checking them will probably be too much work for one person to do, but once there is a single file with all possible mistakes, perhaps it can be cut up and parcelled out.

Sound reasonable?

August 19, 2005 at 05:13 AM

Following the link you gave to Wikipedia's explanation of its conversion system, trevelyan,I skimmed this article: http://www.cjk.org/cjk/c2c/c2cbasis.htm

Our SC↔TC code mapping tables are comprehensive and complete. They are not restricted to the GB 2312-80 and Big Five character sets, but cover all Unicode codepoints. In the case of one-to-many SC-to-TC mappings, the candidates are arranged in order of frequency based on statistics derived from a massive corpus of 170 million characters, as well as on several years of research by our team of TC specialists.

If you could get your hands on this mapping table ... life would be good.

I agree about excluding word usage from conversion tools -- the entire idea seems a bit silly. Converting 電腦 to 計算機 seems tantamount to translating "mate" to "friend" when shifting between British and American english, at least.

They talk about different levels of conversion in the article; they consider conversion based on dialectic conventions lexemic:

Level 1 Code Character-to-character, code-based substitution
Level 2 Orthographic Word-to-word, character-based conversion

Level 3 Lexemic Word-to-word, lexicon-based conversion

Level 4 Contextual Word-to-word, context-based translation

I've always had a preference for storing data encoded as utf8, because it encompasses just about all glyphs in existence, compressed into a single extensible standard. A shot in the dark here: if you set the doctype of your html page to utf8, does the browser automagically convert any input to utf8? You can construct your database purely in utf8, the convertion table also in utf8, and still deliver the data in the user's preferred form (SC or TC [which I'll use to mean words in addition to single characters]) AND preferred encoding (if they select something other than utf8 as an option, do the utf8->[gbk|big5] conversion, which should be as simple as calling the right function).

- User enters some data.

--> Data converted to utf8 if necessary, stored as-is (SC or TC)

- User requests data in simplified form

--> Data rammed through TC->SC mapping converter (maybe something happens, maybe it doesn't - we don't need to tag the original data SC or TC)

- User requests data in traditional form

--> Data rammed through SC->TC mapping converter

--> Display ambiguous conversions to the user along with the original data and frequency information (the relative frequency of each possible conversion)

- User requests data in simplified form, GBK encoded

--> Data rammed through TC->SC mapping converter

--> Converted data undergoes encoding-conversion to GBK (should be easy)

Whenever a user adds a word, it would first be converted to traditional (if not already), and stored into the db. Then when user's are viewing any given page, the page can load in either simplified, or in traditional based on their preferences.

This doesn't appear to solve anything, since you're only encountering the problem of converting SC to TC earlier, in the initial store step, instead of later when another user makes a request. How will you deal with the ambiguities when storing? When you display the ambiguous conversions to the user, you're not losing any original information, and I think that's important.

If you want to have the simplified and traditional variants side-by-side in the backend, how would you want to handle accepting & modifying entries? You would want to have the fields linked together so that when one is modified, the other is also modified accordingly.

Perhaps it's better not to do autoupdate TC entries when SC entries are made, or vice versa. Or at least only do so when a 1-1 relationship exists between SC and TC in the conversion map. It's probably best to be conservative when defining hard links like these automatically. When an SC maps to more than one TC, err on the side of leaving the TC unmodified instead of risking a totally inappropriate modification.

My thoughts for now ...

August 25, 2005 at 11:18 AM

A shot in the dark here: if you set the doctype of your html page to utf8, does the browser automagically convert any input to utf8?

For at least the major browsers, yes.

the candidates are arranged in order of frequency based on statistics derived from a massive corpus of 170 million characters, as well as on several years of research by our team of TC specialists.

Sounds interesting. Think it would probably be easiest to have a list of ambiguous characters and do manual checking though. Those familiar with the complex script would not necessarily need to know anything about the English translation to be able to judge whether the computer-generated complex form was accurate.....

Perhaps it's better not to do autoupdate TC entries when SC entries are made, or vice versa

Yeah... this is a tricky issue. If we have that list of characters which map ambiguously, we should be able to automatically generate fantizi/jiantyizi for all words that don't contain ambiguous characters, and automatically generate lists of words which need to be checked.

Big issue is still basically getting a list of the ambiguous jiantizi and the multiple fantizi to which they map..

September 3, 2005 at 02:40 PM

Hey,

Sorry it took me so long to reply-- have been in the process of moving and re-adjusting to taking non-Chinese classes

I Google'd up a few sites which have lists of the one-to-many, many-to-one mappings for between simplified and traditional. I believe it should be sufficient to cover at least the majority of erroneous conversions that would come up.

1. Wikipedia:简繁一多对应校验表

2. 單個簡體字對應至少兩個繁體字的現象

3. http://www.panix.com/~asl2/software/middleproxy/s2t.html

4. (a forum discussion containing a decent-sized list

If anyone knows of any others, or any books which contain a list of the mappings, feel free to post them.

Keith

September 5, 2005 at 07:04 PM

Useful, thanks Hugh.

Sign In

Thoughts on implementing 簡體字 <---> 繁體字 conversion system

Recommended Posts

hughitt1

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

hughitt1

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

gyrm

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

hughitt1

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

Join the conversation