Chinese Character Syntax For RegExps?

September 12, 2005 at 05:38 AM

I want to rename the following:

阿爾及利亞阿尔及利亚

...to:

阿爾及利亞 [阿尔及利亚]

...can someone give me a regexp to use?

September 12, 2005 at 10:26 AM

Wow, regexp with Chinese characters. There's something I wouldn't want to have to try.

Are all your strings exactly in that format? Could you just

a) add a ] to the end of each string and then

B) add a [ after the space between the trad / simple characters?

Roddy

September 12, 2005 at 12:22 PM

Roddy's plan is best.

Its not a language issue so much as a question of which programming language you're using, and what encoding you're using for your Chinese characters. PHP is a mess at handling Unicode and still has limited and experimental support for non-ASCII functions. The reason is that once you shift to Unicode you get a lot of variable-length characters -- so the fundamental parsing engine needs to be overhauled.

Last I checked Perl does REGEXP decently on GB2312 (fixed-length), but has trouble with Unicode. There are some new libraries there which might help though. Advice: find a language that allows you to do regexp on Unicode, and then convert any content to that encoding before doing any of the changes. I know IBM has a dedicated library in C++ that will do REGEXP on unicode strings, but that may be overkill.

September 12, 2005 at 04:01 PM

With Perl 5.8 you can treat Chinese characters as one unit, if you are careful how you load them from file. In that case, a simple regex would do:

s/^(S+s)(S+)/$1[$2]/;

See http://www.chinesecomputing.com/programming/perl.html for some other possibilities.

September 13, 2005 at 09:47 PM

a) add a ] to the end of each string and then

B) add a [ after the space between the trad / simple characters?

The problem is that there are 29' date='079 entries in the latest CEDICT UTF-8 database. Would rather use a regexp...

s/^(S+s)(S+)/$1[$2]/;

Now how would I include this in a replace command?

Replace:

s/^(S+s)(S+)

With This:

$1[$2]

Is that correct?

October 10, 2005 at 08:51 AM

I still have not figured out how to transform:

爱沙尼亚愛沙尼亞 ai4 sha1 ni2 ya4 Estonia

(trad word)(simp word)(pinyin)(definition)

to:

爱沙尼亚 [愛--亞] ai4 sha1 ni2 ya4 Estonia

(simp word)([trad word with - replacements])(pinyin)(definition)

Can someone give me a nice regexp to use? The above examples were great, but i could not apply them sucessfully...

October 10, 2005 at 01:03 PM

Sometimes the easiest is to take your Chinese text and convert it to decimal codes (number;) in Wenlin (The demo has this function for free) and use regex with numbers instead. Works well.

If you want to spend the money, PowerGREP 3 works with double byte characters now. I highly recommend this program to anyone who uses regex on a frequent basis.

J

Sign In

Chinese Character Syntax For RegExps?

Recommended Posts

adamlau

Link to comment

Share on other sites

roddy

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

chinesetools

Link to comment

Share on other sites

adamlau

Link to comment

Share on other sites

adamlau

Link to comment

Share on other sites

Konglong

Link to comment

Share on other sites

Join the conversation