Jump to content
Chinese-Forums
  • Sign Up

Chinese OCR


techie

Recommended Posts

I've tried the free OCR program called COCR2 and it works pretty well though it's quite tedious to do. Basically, you have to open a BMP or DIB file and use a small rectangle to capture words in the image file. Thus, you have to scan every single character.

Next, I'd like to try Readiris Pro 11 and scan a book I have called "Why We Want You to be Rich" by "Donald Trump, Robert Kiyosaki" which was translated into Traditional Chinese. It's just been sitting there for a year now as my vocabulary never got to the point where I could read it without any problems. Hopefully, an OCR can solve this dilemma of mines. :mrgreen:

Edited by ABCinChina
Link to comment
Share on other sites

I got the Readiris Pro 11 Multilanguage version. I Think it does almost every language on Earth since even languages like Zulu are included.

Edit: Seems like Readiris can only get very clear image files translated into text correctly. If the font is not recognized or if the picture is slightly blurry, then many words will come out wrong. Looks like Chinese OCR software is not all its cracked up to be...

Edited by ABCinChina
Link to comment
Share on other sites

  • 2 months later...

This past summer Abbyy (following is link where you can download free 14 day trial)

http://finereader.abbyy.com/?from=google_finereader&gclid=CI7y0pCe-ZcCFQwcegodqz0JCw

came out with software that will ocr Chinese. I am interested (to hear other people's experiences) to know how how well it works.

The software is big (245MB) and expensive (US$400) although computer Hobbyists can certainly find a way around the cost. It is very easy to use. Just drag in a picture of Chinese text and the software will ocr the page. Also can batch convert a folder of multiple pages.

I went to a small copy shop in China and they never heard of such kind of software. I guess labor is so cheap that they don't mind retyping a whole document. Also they are good at using Photoshop to edit jpg texts. I don't know if there are good --made in China --versions of ocr software. Abbyy is a Russian company.

  • Like 1
Link to comment
Share on other sites

  • 3 months later...
  • 10 months later...

This is a free but slightly tedious way to *study* (not good for scanning or saving) documents in simplified Chinese, mainly if you don't recognize the characters or know the meaning of the words. Once you get a feel for the system, you can save a lot of time from looking up individual characters. It does require installing a few new pieces of software and a little technical knowledge (installing software, using a digital camera / scanner, doing small photo edits, etc.), but if you can't do it, I'm sure you can find a tech buddie. Set-up time: 2-3 hours.

1. Download and install the COCR2 software here: http://users.belgacom.net/chardic/cocr2.html#tutorial

2. Download and install Chinese Perapera-Kun Chinese Popup Translator for Firefox (and Firefox if you don't already have it). It's found here: https://addons.mozilla.org/en-US/firefox/addon/3349. I have found that this free mouseover software works better than some other popup translators because the database is larger and includes idioms (including ones that the MDBG Reader did not have). But it is free and has its limitations. You'll have to restart Firefox and turn on Perapera-Kun, but after that, you should be good to go.

3. Download and install a photo editing / resizing program. I am using a freeware program called FastStone Photo Resizer which has a batch convert feature (converts many files at once) and Microsoft Office Picture Manager for its auto correct and rotate functions. There are many other free packages available.

4. Scan or take a digital photo of the paper or page. I use an 8MP digital camera set on its highest resolution. I suppose lower resolutions would work as well. Just find a flat surface with decent lighting and use the "macro" feature of the time for taking close-up images. I typically do one 8 1/2 x 11" page at a time. Upload these images to your computer.

5. Use photo / imaging software to edit the files (I only use auto correct to brighten and rotate to make it upright). Then convert them to .bmp files. COCR2 will only read .bmp files.

6. Open up COCR2 and load the image from the "File" menu. There will be a box that you want to enclose around the character. You must include the entire character and only one character; white space around the character is fine. The enclosure box is supposed to change sizes but It doesn't seem to work on my system. So I have to edit the image to fit the box. This will be different for each user. In FastStone Photo Resizer, I use the "Advanced Options" -- "Resize" to re-size the picture to 55%. Once you figure out that percentage once and then continue to take the digital pictures (or scan) in the same way with the same resolution, you shouldn't have to figure it out again. This will involve modifying the percentages and re-opening images in COCR2 until you find the right fit.

7. Once you enclose the box around the character, you can select 0-9 to pick the correct character. It will show up in the right pane. If you have a pop-up / mouseover software, it will recognize the character immediately in the right pane. What I do is copy the characters / words into a Word document and then save it as an .htm / .html file. Then I open it up in Firefox and use Chinese Perapera-Kun Chinese Popup Translator to read it. If you make changes to your Word document or add new characters, you can easily save the document and then "reload" the page in Firefox.

This method should also make it fairly easy to copy and paste new characters into a software flash card system such as Mnemosyne or Anki. Thanks goes out to those who have developed these free software packages that we often take for granted and the tips from previous posts.

Blessings in Jesus.

Link to comment
Share on other sites

  • 2 weeks later...

Anyone know of any software that you can use on a cellphone

Kind of like Google Googles?

http://www.google.com/mobile/goggles/#landmark

Eg you take a picture on your phone of some text and it translates it for you.

Apparently Google tesseract OCR now supports chinese

http://recaptcha.net/learnmore.html

http://www.ocrterminal.com/

Basically with my Nokia n900 I could right a small script in python to take a picture send it to a server and get it back or use Tesseract directly on the phone

http://danielwould.wordpress.com/2010/03/07/optical-character-recognition-on-the-n900/

Link to comment
Share on other sites

Hi all!

I decided to purchase Presto! Danching but, after sending the order, it was cancelled by Newsoft: they told me that they do not offer international shipping (I live in Italy).

Is there anybody willing to sell me a licence with the software?

Kind Regards,

罗吉 (Gilberto)

Link to comment
Share on other sites

  • 8 months later...

I've used Danching in the past. Spent way too much time editing mistakes to like it. Abbyy Finereader 10 has got to be the best OCR software I have used for Chinese Traditional. With Image PDF files at 300dpi, I am converting at about 99% accuracy. I highly recommend this software. The only thing I think that would make it better would be some way to import your own dictionary terms.

Link to comment
Share on other sites

  • 2 months later...

I got ABBYY FineReader 10 and it is kinda performing well. It does both traditional and simplified mostly reliably. I want to use it to pass through a scanned PDF to add the text layer.

Now one small problem it is incapable by itself to handle mixed Chinese-English text (a textbook). Still you can manually set language for scan areas, so it is not that bad, just a bit manual work.

However another problem that I do not manage to resolve is - how to make it scan pinyin and keep the tone marks. If I declare pinyin text as English, FineReader removes all accents; if I declare it as Spanish say, it leaves some accents, but not other. It is too clever on the one hand, but too dumb to know that pinyin is part of Chinese.

I wonder is there a language that uses all the markings used by pinyin, so I can tell FineReader to use it for those text blocks?

In this (combined) thread 5 years ago or so, someone suggests "OC Software Sunmipage Chinese/English Bilingual OCR Engine" to solve this problem, however these guys look to be out of business for now (at least I fail to find a live link to them). And FineReader otherwise does a good job. Any other ideas?

Edited:After poking around for some time it seems it is possible to handle this with FineReader 10 (I do not know it it possible with earlier versions). For example, I have a textbook and the text contains Simplified+Traditional characters, English text and pinyin. Solution:

1) define a new language based on English, call it for example Engish Pinyin, and add special characters ǎěǐǒǔāēīōūáéíóúàèìòù to the character set.

2) define 3 languages for the document Chinese Simplified, English and English Pinyin. Chinese Simplified seems can handle correctly traditional characters as well

3) one last twist - characters ǎǐǔ does not recognized by the Finereader even after the manipulation above. I tried even "training", but nothing helped; FineReader was ignoring these 3 characters. So you really need to use similar looking characters. Replace:

ǎ U01D4 with U0103

ǐ U01D0 with U012D

ǔ U01D4 with U016D

That leaves us with a bunch of pretty rare combinations - ǚ U01DA, ǖ U016D, ǘ 01D8, ǜ U01D6. The first ǚ in my test also did not get recognized, or rather recognized as (the replacement) ǔ; for the rest I did not see any occurrences in my text this far.

I imagine that I can use the pattern training and map ǚ to ũ, or maybe leave them all without tone marks and map to ü. Finally OCRing involves certain manual work anyway, so adding these pretty rare characters during editing seems to be a reasonable tradeoff-

Edited 2 I ended up editing the text and typing in missing tones for ü. You have to edit it anyway as the FineReader not always manages to get the accents/tone marks right and the missing accents for ü is the least worry. Finally as the last step you can do the mass replace of all look-alikes of ǎǐǔ with the real characters.

  • Like 1
Link to comment
Share on other sites

  • 2 years later...

Hi everybody! I'm a complete newbie, so first of all I wanna apologize if I'm posting in the wrong place; cut to the chase:

Issue 1: I have some pdf files containing only pinyin, and tried to ocr them; I've found no difficulty at all with the set of latin alphabet, except for the next set of characters { o ā ɑ̄ ē ī ō ū ǖ Ā Ē Ī Ō Ū Ǖ á ɑ́ é í ó ú ǘ Á É Í Ó Ú Ǘ ǎ ɑ̌ ě ǐ ǒ ǔ ǚ Ǎ Ě Ǐ Ǒ Ǔ Ǚ à ɑ̀ è ì ò ù ǜ À È Ì Ò Ù Ǜ a ɑ e i o u ü A E I O U Ü } which are not recognize by either Abby, Acrobat, Tesseract etc. I've tried to train them, use a combination of different languages, and a million things more like asking in dozens of forums, but no luck.

Issue 2: I also have some resources in true-pdf format with those damn subset of embedded fonts, and when trying "copy-and-paste" activities to rearrange the layout, the file becomes unmanageable because the text get completely illegible. I've installed most of the fonts that I didn't have on my pc, and also the Pitstop plugin, but cannot find a solution -for example, substituting throughout the file all those characters that use a certain embedded subset of a font by a different font, keeping the original character shape.

As you can see issue1 and issue2 are related inasmuch as solving issue1 would also put and end to #2.

So I hope to hear good news soon.

Thanks in advance

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...