Jump to content
Chinese-Forums
  • Sign Up

Strategies for indexing Chinese-language material for searching purposes


Jockster

Recommended Posts

The fact that there are no spacess between words throws a spanner in the works when building search engines for Chinese-language material. I am referring to search engines for a finite set of material here, NOT Internet search enginers. Approaches used European languages don't work, because a computer does not know a priori how to delimit the words in Chinese. You can index each character separately, match the material against a dictionary when you index it, etc. It is hard.

Does anyone have an idea of how well Google, Alibaba, Baidu perform in this respect? Of course, if you tap in a few common search terms in the Chinese version of Google, you get a lot of hits. But - and this is a blessing for the Internet search engines - you cannot tell whether the search engine missed something, whether some hits are missing. Obviously the reason is that none of us know what is out there - this is one of the reasons we used the search engine in the first place.

When you have a finite set of documentation, say 10.000 pages if printed, then you can (relatively) easily check whether the search engine's accuracy is 100%, because you can check the material.

Any programmers out there who would like to share their insights? :)

Link to comment
Share on other sites

Agreed not having spaces to delimit words is a big hassle for programmers, especially if you want to do something like count word frequencies. I'd guess the search engines don't use dictionaries, as it would be useless for names and for new words, which would be missing from dictinaries but are usually exactly the kinds of things people want information about.

At a guess I'd say it wouldn't be a big problem for search engines though, they'd just treat each character as a seperate word and use the same code/approach for finding compound words in Chinese as they do adjacent words in English. The approach for indexing "Buckingham Palace" (with quotes) should be the same as for 故宫. I don't think Google needs to know if something is a word or a phrase.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...