Jump to content
Chinese-Forums
  • Sign Up

Chinese Character Syntax For RegExps?


adamlau

Recommended Posts

Wow, regexp with Chinese characters. There's something I wouldn't want to have to try.

Are all your strings exactly in that format? Could you just

a) add a ] to the end of each string and then

B) add a [ after the space between the trad / simple characters?

Roddy

Link to comment
Share on other sites

Roddy's plan is best.

Its not a language issue so much as a question of which programming language you're using, and what encoding you're using for your Chinese characters. PHP is a mess at handling Unicode and still has limited and experimental support for non-ASCII functions. The reason is that once you shift to Unicode you get a lot of variable-length characters -- so the fundamental parsing engine needs to be overhauled.

Last I checked Perl does REGEXP decently on GB2312 (fixed-length), but has trouble with Unicode. There are some new libraries there which might help though. Advice: find a language that allows you to do regexp on Unicode, and then convert any content to that encoding before doing any of the changes. I know IBM has a dedicated library in C++ that will do REGEXP on unicode strings, but that may be overkill.

Link to comment
Share on other sites

a) add a ] to the end of each string and then

B) add a [ after the space between the trad / simple characters?

The problem is that there are 29' date='079 entries in the latest CEDICT UTF-8 database. Would rather use a regexp...

s/^(S+s)(S+)/$1[$2]/;

Now how would I include this in a replace command?

Replace:

s/^(S+s)(S+)

With This:

$1[$2]

Is that correct?

Link to comment
Share on other sites

  • 4 weeks later...

I still have not figured out how to transform:

爱沙尼亚 愛沙尼亞 ai4 sha1 ni2 ya4 Estonia

(trad word)(simp word)(pinyin)(definition)

to:

爱沙尼亚 [愛--亞] ai4 sha1 ni2 ya4 Estonia

(simp word)([trad word with - replacements])(pinyin)(definition)

Can someone give me a nice regexp to use? The above examples were great, but i could not apply them sucessfully...

Link to comment
Share on other sites

Sometimes the easiest is to take your Chinese text and convert it to decimal codes (number;) in Wenlin (The demo has this function for free) and use regex with numbers instead. Works well.

If you want to spend the money, PowerGREP 3 works with double byte characters now. I highly recommend this program to anyone who uses regex on a frequent basis.

J

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...