Jump to content
Chinese-Forums
  • Sign Up

Chinese Word Segmenter


xilus2

Recommended Posts

The following is my personal opinion:

 

If you don't want to learn to read Chinese, learn pinyin. This is only really recommended for people who don't want/need to interact with the language on a very deep level - perhaps you want to get up to survival level for a brief period of travel.

 

If you want to learn to read Chinese, you're going to have to do it without spaces sooner or later (and it's going to be sooner, unless you're deliberately shielding yourself from contact with authentic materials). Segmenting text correctly in your mind is a skill, and by having software do it for you, you're depriving yourself of the opportunity to develop that skill. If the software also makes lots of mistakes, you're doubly handicapping yourself.

  • Like 3
Link to comment
Share on other sites

Maomaoit'snotnecessary,butitgreatlyimprovestheefficienceywithwhichonereads. Word separation is there because it makes reading a lot quicker as you only have to decide where the word breaks should be when you're writing. And the writer will generally know what the words are. Occasionally, you'll have some disagreement about when exactly words should be concatenated, hyphenated or left separated, but those tend to happen relatively infrequently. I was reading a book from a hundred years ago and they hyphenated to-morrow.

 

What's more, for beginners it's even more important because they don't necessary have a large enough vocabulary to know how the words are supposed to be combined

  • Like 1
Link to comment
Share on other sites

butitgreatlyimprovestheefficienceywithwhichonereads

Only at a very low level.  At least with Chinese, I find that now that I am used to no spaces, having them in slows me down considerably.  In any event, I agree with the duck that for Chinese, reading without spaces and learning where to break words is a skill you need to develop.

 

Regarding the segmenter, it doesn't work for long texts (it truncated the text when I tried it with a short novel) and it takes a long time to segment.  Both of these things are issues I tried to solve with my own segmenter.

Link to comment
Share on other sites

@imron and Demonic_Duck, it applies to all languages to some degree or another.

 

Chinese probably has less of an issue with it due to the fact that most words are 3 or fewer characters in length with most common words being one or 2. But, it is still a drag in efficiency in those cases where a character could be shared between multiple words, a space separating them would eliminate the ambiguity and somewhat reduce the amount of work necessary when reading.

 

Anyways, how native readers and individuals who have been reading for a long period of time perceive it is slightly off topic as the OP is presumably still learning how to read and until one has an idea what words to expect in collocations, it's rather tough to know where the word boundaries are if they're not visible.

Link to comment
Share on other sites

Unfortunately, I strongly suspect your hypothesis (that having no spaces is a "drag in efficiency") is untestable. As far as I know, there aren't any living languages which are sometimes written with spaces and sometimes without, though I could be wrong about this. More to the point, even if such languages exist, have studies been done comparing how quickly/efficiently people who have been taught to read with or without spaces can read text in that language?

 

If you can point to some strong evidence, I'll happily eat my words (with the caveat that if the language studied has an alphabetic script, the same may not necessarily hold true for a syllabic script). Otherwise, your claims are without basis.

Link to comment
Share on other sites

I strongly suspect the point is moot anyway, because whatever theoretical increase in speed you might get from adding spaces to Chinese, the reality of the situation is that the overwhelming majority of Chinese content does not contain spaces, and that situation is unlikely to change any time soon.

 

What this means is that if you want to get to the level where you can read native content, then you need to learn to spot word boundaries without spaces.  The best way is not to have things pre-segmented, but to struggle through trying to make sense of non-segmented text (that whole 'trying to make sense' part is where the learning is done).

  • Like 1
Link to comment
Share on other sites

Of course people should eventually learn to read unsegmented texts, and the sooner they start getting used to them, the better.

But for a beginner, a tool like this might be useful when tackling unfamiliar texts with many unfamiliar words. Written Chinese can be really daunting in the beginning, most of us eventually forget just how daunting it was :)

  • Like 2
Link to comment
Share on other sites

@hedwards

 

Segmentation for English is not words without spaces. When I started to learn English we segment the sentence  for example "John Smith, /our teacher,/ came in/ with a book in his hand."

The problem I found from the software is it simply separate two or three characters words which should be together. For example

 

从 试卷 的 得分 情况 可以 看出 , 考生 文言文 的 断句 能力 较差 , 这 实质上 是 缺乏 文言文 的 语感 。

 

得分情况 should be together  断句能力should be together. The sentence should be segmented in this way 

 

从试卷的得分情况/ 可以看出 , 考生文言文的断句能力/ 较差 , 这实质上是 /缺乏/文言文的语感 。

Link to comment
Share on other sites

I pasted a few paragraphs to the tool and got the following. I have no strong views about people using whatever tools to learn a language. But I think the tools should be effective and accurate. Some might still find this tool useful. I don't. Perhaps this is just a difference in opinion.

我 們熱愛 的 小城 , 出現 了 前所未見 的 撕裂 ; 我 們 建立 起 來 的 良好 警民 關係 , 也 因 對峙 的 政治 局面 、 劍拔 弩張 的 狀態 , 造成 武力 衝突 , 引起 流血 事件 , 已 再 難 修復 。

警察們 的 警棍 , 已 不 再 是 用來對 付 強盜 和 罪犯 。 和平 示威 的 學生 和 市民 , 也 成為 警棍 下 的 受害者 。 警棍 , 就是 代表 著 當權者 的 權力 , 就 像 村上春樹 所 說 代表 政府 的 那 堵 高牆 , 像 「 轟炸機 、 戰車 、 火箭 與白磷彈 」 一 樣 , 手無 寸鐵 的 學生 和 市民 「 被 壓碎 、 燒焦 、 射殺 」 。

再看看 一 位 十四 歲 女孩 在 牆 上 畫 花 的 遭遇 。 她 , 只有 一 枝 粉筆 , 默默 的 在 牆上 繪花 , 表達愛 自由 、 爭取 民主 、 反抗 不 義 的 情操 。 卻被 十四 位 警察 圍困 、 威嚇 、 拘捕 和 扣押 , 若非 市民 團結 起來 的 吶喊 和 支援 , 小 女孩 被 折磨 的 苦難還會繼續 下去 。

PS - the tool has failed to show “我們”、“熱愛”、“劍拔弩張”、“對付”、“表達”、“愛”、“手無寸鐵”、“不義”、“苦難”、“繼續” etc as separate words.
Link to comment
Share on other sites

Could it be anything to do with simplified/traditional character sets? Perhaps the segmenter has inadequate support for traditional.

 

To me, the main problem with the site (if I was inclined to want to use that function) is the inordinately long time it takes to work. Imron's software says it can segment a novel in under a second. This site takes at least ten times as long to segment even a short paragraph. I can't imagine that this is accounted for by being online/offline tools, nor by the fact that the output is given as a paragraph with spaces rather than individual words.

Link to comment
Share on other sites

@Demonic, it definitely is testable, it's just that I'm not aware of anybody doing so. And unless the Chinese government starts considering the possibility, the only actual benefit would be beginner materials that are more friendly to the beginner. It would be a bit tough to do a proper double blind test, but that doesn't make it untestable.

 

As I've already noted, Chinese is probably less sensitive to the lack of word breaks than a language that uses a more limited set of characters. A large part of the problem in English is that if you have the letters i t i and s in sequence, you don't know which of several possibilities it is. It could be "it is" it could be "itis" and it could be a somewhat ungrammatical "I tis." Chinese seems to have fewer possibilities, so it probably does impact efficiency less. But, that being said, it isn't non-zero and it's more efficient for the writer to just segment things prior to reading as that's something that is done once. Whereas segmenting by the reader is done each time somebody reads the text.

 

@Maomao, I remember when I started to read German going through that. Segmentation is between words, the stress pattern is something different and somewhat more difficult to pick up. It requires that you not just see the words separately, but consider the next words as you're reading the current word. And to consider the intent of the author. Depending upon how precisely you modify the stress pattern you can wind up with a very different sentence. Mandarin being Stress timed rather than syllable timed has similar possibility as well.

 

As interesting as this is, I don't think it's terribly productive to argue about as the government is probably not going to recognize the obvious superiority in word breaks, especially for the purpose of boosting literacy for the masses and making it easier for foreigners to learn the language.

Link to comment
Share on other sites

When I said "untestable", I meant untestable in fact, rather than untestable in principle.

 

As interesting as this is, I don't think it's terribly productive to argue about as the government is probably not going to recognize the obvious superiority in word breaks, especially for the purpose of boosting literacy for the masses and making it easier for foreigners to learn the language

 

Yes, it really is quite astounding that neither the government of the PRC nor the government of the ROC will change their entire writing system based on what some guy on some forum said, especially when word breaks are so obviously superior that their superiority doesn't require any supporting evidence.

Link to comment
Share on other sites

I have the explanation for the delay using the software:

 

Has anyone seen the episode of 爱情公寓 where the character 悠悠 (who doesn't speak Japanese) is able to talk with 关谷's Japanese father because 字幕组 is translating for them real time?

 

Somebody somewhere is putting spaces in all the wrong places :)

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...