Jump to content
Chinese-Forums
  • Sign Up

Ask for suggestions on a free Chinese reading software


yifeng

Recommended Posts

Hi everyone! I am a teacher at Zhangzhou Institute of Technology, China. My study focuses on Natural Language Processing. Based on this technology I have developed a software to help users read texts in Chinese. Smart Chinese Reader offers several advantages over traditional learning programs, including:


  • Learn by using
  • Higher precision for Chinese word segmentation
  • Full text translation
  • Chinese-English phrase mapping
  • Color part of speech notation
  • Full text pronunciation
  • Handling traditional and simplified Chinese characters simultaneously
  • No Internet connection required

 

I don't know if there is a demand for this software. Could you please give me some suggestions? Smart Chinese Reader is freeYou can download it from http://www.nlptool.com  I look forward to hearing from you. Since my software is still very new, any feedback / constructive criticism is highly appreciated!

Link to comment
Share on other sites

My first thought was 176MB is mighty large, but then you say it comes with a large sentence database for translating and segmenting so I can understand the size. Unfortunately your webhost doesn't seem to be able to handle the load, because it's currently downloading at bytes/second and Firefox is telling me the download won't complete for another 480 days.

 

One other small thing.  The <title> of your website has a spelling mistake: 'Cinese' instead of 'Chinese'.

Link to comment
Share on other sites

okay, I downloaded this (on third attempt it took about 15 minutes), and it seems to run. However, if you give it large amounts of text, it crashes. I haven't worked out everything it does, but don't get your hopes up about "Higher precision for Chinese word segmentation".
 

著名

he is a famous poem of assessment of the house ,

 

 

So it couldn't figure out 'poetry critic'. A few others like this.

 

It might be useful, but I'm not quite sure for what yet.

Link to comment
Share on other sites

A huge phrase-based statistical Chinese-English translation database and a 109,614 entry dictionary are shipped with the software and operate locally (offline). That's especially valuable when working in areas with poor or non-existent web connections. If you encounter problems while downloading the installation file, you can try to download it from the following hosting sites:

 

http://www.softpedia.com/get/Others/Home-Education/Smart-Chinese-Reader.shtml

http://download.cnet.com/Smart-Chinese-Reader/3000-2279_4-76178599.html

http://www.winsite.com/Home-Education/Language/Smart-Chinese-Reader/

 

 

The <title> of your website has a spelling mistake: 'Cinese' instead of 'Chinese'.

Thank you very much for correction!

Link to comment
Share on other sites

  • 4 weeks later...

I was able to download from your host at 700kb/s, grabbing it in a 3 or so minutes. So, maybe the issue is intermittent or geographical for some. So far so good. I think it's a great concept. The biggest problem I found was that it is so slow, and crashes on long sentences. Sentence segmentation needs a little more work, for example it doesn't understand . as a period.

 

When I first fired it up, I copied in a few pages of Chinese text, hit the button and expected it to be marked up in a few seconds. It hung there until it told me the sentence was too long, and it did this for each sentence, hanging between each one until it gave a string length error and crashed to the desktop.

 

The second time, I tried it with only a few sentences to check the accuracy. Some of the example sentences I parsed ended with a '.' and were merged together. After I fixed that, I'd say it was fairly accurate for an automatic POS tagger and translator, especially without the budget of Youdao or Google. Quite an accomplishment.

 

I understand a lot is going on under the hood, but if you could increase the speed to something even remotely close to Chinese Text Analyser, it would be incredible. As it stands, if I have to copy in a few sentences at a time, it's more like a  dictionary than a reader.

 

Also, I'd love for pinyin annotations/mouse over definitions above the characters. It just makes sense. Currently, if you don't know the reading for a character, you would either need to flip back and forth between tabs, or play the horrendous Microsoft TTS. Having a way to query the dictionary separately would also be a cool feature.

 

This could be great, as long as the speed issue is resolved. I would put that as priority 1.  It does give you some quick insight into sentence structures and I think this is a great project. Good job.

 

Also, more of a question, what corpus of example sentences did you use?

  • Like 1
Link to comment
Share on other sites

Thank you very much for being so patient to try smartCR and give the feedback. I am sorry it has  wasted your time in trying the long text.

 

The biggest problem I found was that it is so slow, and crashes on long sentences. 

When you press the "Segment" button, SmartCR performs three tasks:

 

(1) word segmentation

(2) new person name and place name reorganization

(3) machine translation

 

Task (1). Unlike the simple maximum match algorithm, our word segmentation is based on the statistical natural language processing technology. When there are several segmentation options for a sentence, it chooses the one which has the largest overall probability. Although more calculation time is needed than the maximum match algorithm, it is rather quick and acceptable.

 

Task (2) doubles the processing time

Task (3) takes the most amount of the processing time

 

In the next version of SmartCR, I will give up task (2).

 

For Task (3), it is hard for me to decide. If I also give it up, SmartCR will run fast significantly. But this way, one important feature of SmartCR will be lost (See point 3 below). So I think of two solutions:

 

Solution A: Add a setting for user to decide to enable/disable (3) themselves. When reading long text, they can disable machine translation. When they want to study a paragraph in detail, i.e., study sentence structure and every word, they can enable it.

 

Solution B: Do machine translation for one sentence a time when users click a translate button following the sentence. (SartCR 1.7 does machine translation for every sentence in the text once users click the "Segment" button)

 

Should I give machine translation up?

 

Sentence segmentation needs a little more work, for example it doesn't understand . as a period.

SartCR decomposes text into sentences in terms of Chinese delimiters ,!:and

In the next version of SmartCR, English delimiters such as comma and period will be taken into account.

 

Also, I'd love for pinyin annotations/mouse over definitions above the characters. It just makes sense. Currently, if you don't know the reading for a character, you would either need to flip back and forth between tabs, or play the horrendous Microsoft TTS. Having a way to query the dictionary separately would also be a cool feature.

My intent is that SmartCR's users can consult dictionary as little as possible. I hope they can read Chinese in a new way which reinforces their memory.

 

You can infer the meaning of a word from the machine translation. If not, you can move the mouse over it in the Chinese sentence, the corresponding phrase in the English sentence is highlighted.

 

"the horrendous Microsoft TTS" is the Microsoft Simplified Chinese voice, you should install the  "Microsoft Lili" voice for Windows 7, or the "Microsoft HuiHui" voice for Windows 8. Both voices are spoken by the native Chinese and of high quality. They are valuable for Chinese learning. Even better, no additional cost is needed. Please refer to http://www.nlptool.com/TTS.html If you have difficulties, please let me know.

Link to comment
Share on other sites

My intent is that SmartCR's users can consult dictionary as little as possible. I hope they can read Chinese in a new way which reinforces their memory.

 

You can infer the meaning of a word from the machine translation. If not, you can move the mouse over it in the Chinese sentence, the corresponding phrase in the English sentence is highlighted.

 

Inferring the reading from the English translation is an interesting idea, but for more difficult passages I'd actually prefer to be able to check the actual reading/definition quickly. exploding the sentence into a list of dictionary entries and scrolling down to find the word in question is inconvenient from a UI perspective. It also means that you don't have to always rely on the translation which is more prone to having mistakes. Perhaps you skip the machine translation on the first step, add pinyin to the top row (like on the example pictures on your website), and do the machine translation on demand on the sentence tab (click to translate for example). That way you don't have to translate everything at once and only do the extra processing on demand.

 

Comparing this short passage between Chinese Text Analyzer (CTA), Mandarin Tools Annotator (MTA), and Smart Chinese reader (NLPtool):

 

    ‘大笨象西餐厅’在北京的朝阳区,是一家不错的俄国餐馆。刘星的这一身打扮进西入西餐厅还是相当惹眼的,

    马上就到中午吃饭点儿了,还有位置,刘星随意的坐在了窗边,因为这个位置能继续他的看美女大业。

    “先生,请问您需要点什么?”一位长相还不错的服务员走到刘星身边礼貌的问道,并没有因为李冲锋这一身混子的打扮而态度不好。

    “需要美女,有吗?”刘星转头看着对方说道。

 

 

All three segmented the passage the same, except CTA and MTA both missed proper names, where NLP was able to identify them properly.  西餐厅 was segmented 西 餐厅 in MTA, and 西餐 厅 in CTA, NLPtool identified the whole thing together and flagged it as a noun 西餐厅. 

 

So, NLP definitely seems to have better segmentation, and the ability to flag places and names is a standout feature.

 

Taking the example sentences on your website:

日本是亚洲内人口老化最严重的国家

 

Mandarin tools segmented this correctly (内 人口), Chinese text analyzer segmented it incorrectly(内人 口), and of course NLP segmented it correctly.

 

黄保中利用职务上的便利为他人谋取利益

NLP was the only one to identify the name.

 

For the bonus round,

 

美女 大小姐

NLP: my beauty eldest-daughter-of-an-affluent-family

Google: My eldest beauty

Youdao: Miss my beauty is big (face-palm)

Bing: My beautiful lady

Properly segmented, but a translation failure all around. This is why you need a popup dictionary for people to check automatic translations, otherwise you WILL inevitably be teaching incorrect words. Even google didn't select the correct meaning, no matter how good the algorithm is, mistakes will be made and checking them should be as made as easy as possible.

 

On the issue of speed. It's important to keep in mind that Mandarin tools can segment and annotate millions of characters in a few seconds, enough for an entire book at once. CTA is even faster, and can process 1.5 million characters in half a second on my computer. NLP tool takes about a minute on my system to process just one paragraph (20 sentences) and crashes on longer passages.

 

I'm hoping you can figure out another way to optimize the script, because it does seem like when it comes to word segmentation and POS tagging, it is better than other software. 

 

To summarize:

If you have machine translation, you need an easy way to check the translation (popup dictionary).

I hope there are other ways to optimize the code. it's clear that you software does some really nice language analysis, but it's not just hundreds of times slower than other software, it's thousands, if not tens of thousands times slower

 

Cons:

Really slow/resource hog.

no popop dictionary/readings,

can't look-up or read individual words.

 

Pros:

Good segmentation,

totally offline,

decent translation,

POS tagging,

Support for TTS.

proper names and places identified.

 

 

 

I'm wondering where you mined the sentences though? I'm trying to make a sentence dictionary myself, and it's hard finding enough material.

  • Like 1
Link to comment
Share on other sites

Perhaps you skip the machine translation on the first step, add pinyin to the top row (like on the example pictures on your website), and do the machine translation on demand on the sentence tab (click to translate for example). That way you don't have to translate everything at once and only do the extra processing on demand.

Very good idea! I appreciate your suggestion.

 

This is why you need a popup dictionary for people to check the automatic translations

I will implement a popup dictionary. One question is about the main text interface of SmartCR. SmartCR's interface is quite different from other Chinese reading softwares in that one clause occupies a row, furthermore a space is added between two words. All have been segmented.

 

SmartCR's interface

smartcr17_sentence.JPG

 

The normal interface of Chinese reading tools is as follows:

 
DimSum's interface
dimsum.gif
 
When you point the mouse somewhere on the window, the word in that position is highlighted. 
 
Which one is better? For beginners, elementary or intermediate learners? 
 

I'm wondering where you mined the sentences though? I'm trying to make a sentence dictionary myself, and it's hard finding enough material.

I bought it. You can try http://www.docin.com/p-652959856.html first .

Link to comment
Share on other sites

Although I'm quite biased, I personally think that automatic popups like DimSum are harmful to learners, and don't encourage learning so much as just letting people read word-by-word English.

 

That being said, I think SmartCRs interface is also not the best in that initially both English and Chinese are shown together.  Beginner students especially will gravitate towards the English rather than looking at the Chinese.  I would initially hide the English and make it display only when the user presses a button.  This will let them struggle and think for a bit before checking the answer.

 

This will also greatly reduce the need to perform machine translation for an entire document and you can just do it in real-time sentence by sentence if and when the user clicks on it (Solution B you mentioned above).

 

I wouldn't give up on Step 2 just yet.  Disable Step 3 except for when the user clicks a button for that sentence.  Only if it's still too slow would I then consider dropping Step 2, but Step 2 is a pretty useful feature (probably more so than machine translation).

  • Like 1
Link to comment
Share on other sites

Which one is better? For beginners, elementary or intermediate learners?

I think breaking apart the sentences into individual clauses is an interesting idea, I think it's better for beginner-intermediate, but I don't think someone would use one or the other exclusively given the option.

Link to comment
Share on other sites

I wouldn't give up on Step 2 just yet.  Disable Step 3 except for when the user clicks a button for that sentence.  Only if it's still too slow would I then consider dropping Step 2, but Step 2 is a pretty useful feature (probably more so than machine translation).

To imron, your suggestion helps me prioritize and adjust my design. Thank you!

 

I think breaking apart the sentences into individual clauses is an interesting idea, I think it's better for beginner-intermediate, but I don't think someone would use one or the other exclusively given the option.

To Junso1, I consider providing both kinds of interface and let users choose which one to open. I suppose DimSum-like interface is welcome by upper intermediate and advanced learners. First, it can accommodate more text.  Second, users can segment the text themselves, and when they encounter difficulties, they can move the mouse over the questionable character to see the segmentation by the software. Could someone tell me their reading experiences about this?

Link to comment
Share on other sites

It pained every fiber of my body to pay 2 dollars for that 100,000 sentences .pdf, considering the sentences were obviously mined from other online locations. It was only after I had stayed up 4 hours devising a way to rip the flash file from docin and convert all 3300 pages to JPG, and batch optimize them for OCR that I threw in the towel for lack of sleep. Well, the highest accuracy I could get was about 4 errors per page, which was unacceptable for a 3000 page document. So I caved in and made the purchase. I noticed there are a lot of words not in the corpus. I was previously able to mine about 30,000 sentences and except for some chengyu, most words I couldn't find in my existing sentence bank were also absent in the 100,000 sentence corpus. I think it has to do with the nature of the type of sentences, where certain words are under-represented or absent, and others over-represented. I then pulled up an internet novel (300+ chapters and 1.5 million characters), and was able to find nearly any word I searched for. So, I segmented that instead, unfortunately there is no parallel translation, so it's limited use.

 

Anyway, where was I going with this... Is a 100,000 sentence corpus large enough for natural language processing? Maybe mining sentences from novels or something else would be better.

Link to comment
Share on other sites

A 100,000 sentence corpus provides a good starting point. First, it is large enough to test your  algorithms. I started from a 1000 sentence corpus. Second, no a single corpus can fulfill one's task. You have to buy/mine from several places. And there are mistakes and repetitions in every source, you have to devise your tool to clean them up. Definitely it is a time-consuming work. My suggestion is that make the best use of what you have and accumulate. 中英电影台词 is a good search keyword to find parallel translations from Baidu. 

  • Like 1
Link to comment
Share on other sites

That's a great tip, I found it near impossible to clean that file up because the sentences don't follow any sort of standard, they are half-hazard thrown together and there are so much variation in punctuation and line spacing, not to mention a lot of errors. After hitting it with about 20 regular expressions it's 95% there, but nowhere near good enough to re-order and sort the sentences the way I want. I think I will go for the movie scripts instead. Thanks for your suggestions.

 

Have you considered incorporating a sentence (contextual) dictionary in your software? For example, selecting a word could pull up other example sentences from the corpus. Now that would be an AWESOME feature if any word in the document could quickly reference a database of other sentences.

Link to comment
Share on other sites

Yes, I have planned the example sentence search feature for SmartCR, but now the most important thing for me is to ensure that its basic features serve users well, only then can I go further. Take the current version for instance, although it already has several valuable features such as full text pronunciation, color part of speech notation, etc., but users tend to ignore them because of the low processing speed and interface layout.
 
Speaking of color part of speech notation, I would like to say a little more. The syntax of Chinese is quite different from that of English. You should not pay your attention only to the number of known/unknown words. There are many aspects of knowing a Chinese word. Parts of speech (POS) of Chinese words in a sentence are marked with colors in SmartCR. With the color POS notation, you can familiarize yourself with the Chinese syntax in a shorter time. It provide a big picture of a sentence. Or as you have said:

It does give you some quick insight into sentence structures 

 

Sentence patterns are easier to recognize this way, and you will become more sensitive to them. Once you learn the basic structures of Chinese sentences, you can make sense of a sentence more easily based on the parsing.
 

Verbs are the headwords in a sentence, and parsing should center on them, so they are marked with red color. If there is only one red word in a sentence, it has a simple “subject + verb + object” structure, e.g., 练习 技能. If there are two red words, one of them may function like a participle in English, e.g., 练习 写作 技能 (I practice writing skills), or belong to a sub-clause e.g. 决定 练习 技能 (I decide to practice skills)

 

Verb-oriented reading skill is effective not only because verbs convey the major meaning in a sentence, but also in that verb vocabulary is much smaller than noun vocabulary, it is almost a close set. If you spend more efforts on verbs and grasp their uses, your reading skill will improve substantially. If you learn the noun 包子, you only know one more word, which is the name of a food. But if you learn 蒸 as a word of verb, you can make sense of more sentences, for example 馒头、 花卷. Even if you don't know 馒头 or 花卷, you can infer that they must be some kind of food. 

 

Interestingly, you can find some Chinese sentences without any red words. These sentences usually has a green end, which represents an adjective. For example, 问题 复杂 (The problem is complex) , 头发 (She has long hair), 局势 我们 不利 (The situation is going against us).

 

Place names are in the underlined blue font. You will notice that the location is placed before the verb in Chinese, whereas it appears afterwards in English, e.g., 北京 工作.

  • Like 1
Link to comment
Share on other sites

  • 2 months later...

Based on user feedback, we develop Smart Chinese Reader 1.8, which has improved in several aspects including user interface, performance, and robustness. Give it a try. Hope you will enjoy it. Again, your suggestions are always welcome. The download URL is http://www.nlptool.com

What is new in SmartCR 1.8?

1. Increase the speed. SmartCR 1.8 can segment a dozen pages of text in a few seconds, it meets the need of your daily reading.

2. No sentence length limit. Long sentences are processed as fast as short sentences.

3. Do machine translation on demand. English translation is initially hidden, it will be displayed only when the user click a button.

4. An advanced view of the text is added. Not a single reading mode suits all learners. If you are an intermediate or above Chinese learner, the advanced view may be more suitable for you.

5. Popup dictionary. Want to know the definition of a word? Just click the mouse on the word either in the basic view or in the advanced view to bring up its dictionary entry. Dictionary on hover has been considered. Although it saves users a click, it was not adopted finally, because users tend to trigger the dictionary on accident when their mouse moves over the text.

6. Statistics and color notation of HSK levels. It indicates the difficulty of a text and let learners plan their studies easily.

7. Resizable window. The window of SmartCR 1.8 can be resized manually, and when you start the application next time, it is open to the size and the position you have adjusted to.

  • Like 1
Link to comment
Share on other sites

  • 1 month later...
I have presented the above post for a month, but no reply is got. No one has tried SmartCR 1.8? I am curious about its use. If someone has downloaded the software, please say something about it. To save your time, I make a list of possible results of trying SmartCR 1.8:
 
   Cann't download
   Cann't install
   Crash while running
   Unstable
   Low speed
   Inconvenience in user interface
   Useless
   Of some use
   Useful 
 
You can just select one or more items from the list. Of course, detailed descriptions of your experience with SmartCR 1.8 will be valuable for improving it and are highly appreciated.  
 
Some screenshots of SmartCR 1.8 are attached:
 
Basic view:
smartcr18_basic.JPG
 
Advanced view:
smartcr18_advanced.JPG 

 

Color notation of HSK levels:

smartcr18_hsk.JPG
  • Like 1
Link to comment
Share on other sites

  • 3 weeks later...

HI, I downloaded your program and am reviewing it now.

 

It installed properly and I have a few preliminary comments.

First, I'm very happy to see that this is not a "cloud-based" program that requires constant connection to the internet. Much of the world has only spotty connection to the internet and thus cloud based systems are useless.

I lived in China for 12 years and experienced very poor internet service - very intermittent and slow. Glad to see the dictionary built-in.

 

If you are going to release this a an open-source application, it might generate some enthusiasm. But as it is, it seems quite simple and seems to offer little that far more advanced software, like T. Sherrill's impressive ChineseToolbox 2013, has been offering for years. Many programs seem to be created by developers who seem unaware of the work done by others.

SmartReader has a nice feature: it can label parts of speech as "verb", "adjective", "adverb", "noun", or "place name", "person name". This is interesting for developing a clearer understanding of grammar.

I like the way that text I want to read is divided up in the "BASIC" reading panel. Sentences are broken down into sub-clauses that allow the reader to more clearly understand the structure of the sentences. Quite nice.

 

The statistics feature showing which the HSK rankings of words is interesting. But it might be useful if you could create lists of words according to each HSK level and be able to export those.

Also, more useful to me, is a feature that would analyze the characters in a text and show the percentage of unique characters it contains in the top 1500/2000/3000 characters needed for basic literacy. This would be useful to my social science research, and would show learners how close they are getting with various texts to a full range of characters needed for basic literacy.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...