Jump to content
Chinese-Forums
  • Sign Up

Stop words list Chinese


chinasnippets

Recommended Posts

Hi,

I'm working on a project where having a stop word list for characters would come in handy.

I have searched around but I have only found some thesis that studied this but didn't find any actually lists.

Do any of you know of or have a Chinese character stop word list that I maybe can use.

Thanks a lot.

Link to comment
Share on other sites

I learned something new today.

http://hi.baidu.com/seosky/blog/item/d18cfa3360fa4744ad4b5fc6.html

为节省存储空间和提高搜索效率,搜索引擎在索引页面或处理搜索请求时会自动忽略某些字或词,这些字或词即被称为Stop Words(停用词)。
Link to comment
Share on other sites

If you want to do intelligent segmentation or text processing for Chinese text perhaps you should take a look at Adso. It is a Chinese text segmentation and analysis engine. The following command:

./adso -f [file] -g grammar/alexandre1.txt -g grammar/alexandre2.txt --no-phrases

Takes this as input:

在1 月9日召开的2009年全国卫生工作会议上,卫生部党组书记高强指出,促进经济平稳较快增长、进一步改善民生,是今年党和政府工作的首要任务。全国卫生系统和各级卫生行政部门要充分认识做好医疗卫生工作的重要意义和现实意义,坚定不移地贯彻落实党中央、国务院的部署和要求,加快推进医疗卫生体制改革,切实抓好重点项目建设,为改善民生和实现扩内需、保增长的目标服务。

And returns this as output:

卫生工作会议 卫生部党组书记 高强 经济 增长 民生 今年党 和政 首要任务 卫生系统 各级 卫生 部门 医疗卫生 重要意义 现实意义 党中央 国务院 推进 医疗卫生体制改革 重点项目建设 改善 民生 内需 目标

You can take a look at the two grammar files invoked to get a sense of what is happening. The software is basically making selective decisions about what content to filter based on part of speech, word length, and a few other criteria. You generally need to customize it but it is easy to customize to add or subtract rules.

The database is distributed with the software distribution - it includes POS information so you can use that to generate your own stop word list easily enough.

Link to comment
Share on other sites

Thanks a lot for the great replies. Very useful.

The list is great and I'll have to study the adsotrans software (and ask my programmer as well) a bit more on how it could be applied in the specific project I'm working on

Thanks again, I have a starting point now.

Cheers,

G.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...