Jump to content
  • Sign Up

Resegmentation Options


Recommended Posts

A quick introduction to the new resegmentation features of Adso as well as a brief primer on how to write CODE that uses it....

Introduction to Resegmentation

The first part of analysing any Chinese text is breaking it into constituent units. Because Chinese lacks spaces to demarcate words, one mistake that Adso can make is identifying units across what the proper word boundaries should be. An example of this would be parsing the phrase 二十三个代表团 as the number "20" plus the political phrase "The Three Represents" plus the word "group".

The code we have added to support resegmentation can help avoid this sort of error. It provides a way for the software to systematically correct these errors, and lets users specify very specific situations in which entries should be resegmented.

All systematic rules are written into the source code itself. One which had already been implemented is to resegment Obscure Chinese Political Phrases Starting With A Number (OCPPSWAN) when they are preceded by a number. This is useful for ensuring that the number is unified and treated as a single unit. It ensures that the following two sentences, for example, will be translated and annotated differently:



I'm planning to extend this sort of resegmentation to ALL phrases beginning with a number in the database upgrade next weekend. If you have any suggestions for other general rules which should be implemented to improve the quality of the text parsing, please feel free to pass them along.

Writing Specific Rules as CODE


Code to resegment is that surrounded by RESEGMENT tags. Units will be resegmented if any of the CODE tags contained within the resegmentation tags trigger positively. In the two examples below, the first resegments when preceeded by a Country, while the second resegments when 多伦多 is found in the entire Chinese text being processed.



The default behavior is for the segmenter to step in one Chinese character and then resegment from there. You can force the resegmenter to break the segmentation at other points by providing the appropriate break point in the first resegmentation tag. To break the segmentation after the third character, for instance, change the outer tags to....



* There can only be one RESEGMENT tag in any code field

* If there are any tests that people would like added that don't exist, let me know

* examples of existing CODE markup can be found here

* there is currently a 9 character limit for the breakpoint in any resegmentation. Anyone running into the limit should please write.

Link to comment
Share on other sites

Forgive the stupid question, but where do we write these code rules? A lot of the segmentation errors - at least, the ones not attributable to vocabulary problems - happen when Adso reads something as a personal name; for example, in this article, in the sentence "出品方特别邀请了韩国重量级电视节目制作人李泰珩任该剧总策划并邀请了韩国大帅哥车仁表出演男主角," 任该剧 is being read as the name 'Ren Gaiju.'

Is there any way of giving specific characters weight in the name-guessing algorithm? That is, 鑫 and something else come after a known surname, it's probably a (boy's) name, whereas 该 is less likely to be.

Finally, when adding/editing new words, is there a way of saying "this occurs most commonly as a verb, but can also be a noun?" Perhaps multiple 'part-of-speech' boxes? Oh - and really finally this time, when Adso guesses the pinyin of an added entry, it doesn't take into account similar entries with Pinyin specified; that is, "主角" is in there as 'zhu3jue2,' but 女主角, which has an entry but no Pinyin, is 'nv3 zhu3jiao3.' (I've fixed that entry, but I'm sure it's not an isolated case.)

Link to comment
Share on other sites

More thoughts on names:

Here is an example of a common misparse:

葛优搭档刘嘉玲 "Ge Youda" in Feng Xiaogang's new Star Wars rip...

If your number resegmenting works out (like in 三农), would it be possible to do something similar for names - i.e. perhaps check if the last character of a potential name forms a compound with the following character?

Sure, Ge You is famous enough to go in the database, but this happens quite frequently.

Link to comment
Share on other sites


The name identifying algorithm currently works by storing a list of characters that should NOT be included in names, and a list of characters that is OFTEN included in names. I've added 该 to the unlikely list. This is done in the source code since I've always thought of it as a short-term solution. (We could create some kind of markup that would be recognized as well).

When we get frequency data added we should be able to develop a more complex way of parsing names that is sensitive to the frequency of use of individual characters -- working on the assumption that low-frequency characters are more likely to be found in names as individual characters than high frequency ones, etc. In the meantime, if you see a common word that is being bundled into names and should not be, please just make a post about it and I'll add the offending character to our veto list.

Markup should go in the CODE field in the database. You have to go to the main editing page and click the box marked "Show Code" to get the field to appear when you're editing.



I'm not sure I understand exactly what you mean. Resegmenting happens before names, numbers and other non-database content are unified, so it doesn't really provide a way to manipulate those sorts of "higher level" constructions -- we still depend on the algorithms in the software to help us figure out.

We could easily add a rule that recognizes the presence of 档 and uses it to treat names in a special fashion. What would be the logic for how you'd treat 葛优搭档刘嘉玲 that we should put into the source as an algorithm?

Thinking about your post, if it would be useful it should be possible to develop some sort of markup now that tests to see if the last character in any database word could reasonable function as the start of a compound word in the context in which it is found. That might be a useful test for resegmentation:

This may be unnecessary though. One of the changes that was put in place during the shift to support resegmenting was the creation of a separate class to handle text parsing. This gives us a better platform for experimenting with more sophisticated parsing algorithms than the simple "longest-word-match" we use at present....

Link to comment
Share on other sites

On 女主角....

Trying to dig out these problems for words that already exist in the database should be possible, but it will take a bit of scripting and is probably a good week of work for whoever tackles it. Not really a pressing issue but not unsolvable either.

The way we're working things now is not to automatically generate pinyin when words are added to the database, and have the software will guess them automatically for words that don't have them. So there will be an incentive for people to provide the pinyin manually in cases where the "normal" pinyin and tones are wrong.

Incidentally, you can always tell when the machine is generating pinyin because the pinyin is spaced evenly. When the database contains information on the pinyin of any word the pinyin is usually spaced the way it was entered into the database.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...