Jump to content
Chinese-Forums
  • Sign Up

Is Chinese text on webpages already segmented?


webmagnets

Recommended Posts

When I go to an English language web page and long press on a word, it will highlight the entire word. I can understand how the browser or OS knows where the word starts and finishes because it has spaces.

 

However, with Chinese it still knows where the words are. When I long press on the 大 of 大家, the entire 大家 gets highlighted. How does that work?

Link to comment
Share on other sites

Can confirm this is the case, yes; AFAIK all major OSes now include some sort of Chinese word segmentation support, though not every browser / text editor necessarily taps into it.

 

The default approach (used by anybody without the AI chops to do better) is to use ICU's dictionary-based word segmenter, which finds possible breakdowns using a Chinese word list and then picks the most likely one based on word frequencies. (pretty much the same thing we do, though our dictionary's bigger because we're not asking OEMs to devote flash storage to it on a billion devices ?)

  • Like 4
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...