Adobe OCR Woes - Alternatives? a.k.a. What are you all using for your Chinese OCR needs?

November 27, 2018 at 09:55 PM

Lately, I have been looking into ways to make my work easier and it has been quite frustrating to discover that the paid version of Adobe Acrobat (which thankfully I use for other things than processing my Chinese language documents) is painfully incompetent at running OCR on Chinese text. Often, what I get looks like a casualty of wrong encoding, but in fact it’s Adobe’s honest attempt at recognising the text (!?!?).

Am I just extremely behind the times? Is there an alternative magical moderately priced program that does not involve uploading my clients’ sensitive documents to the internet in order to produce a workable text version of the contents of my PDFs? Does Pleco have some kind of PDF-to-Word converter I didn’t know about? Does Chinese Text Analyser do exactly this work? Translators using CAT tools, have you found that they have been able to handle such a task? (SDL Trados seems to have terrible support for Chinese language docs as far as I can tell.)

I have stopped ripping my hair out, but I don’t know that I can realistically continue to avoid figuring out how to do this and survive in this market. This man-child needs help please.

November 27, 2018 at 10:42 PM

Who better than Chinese people know which OCR program is the best for Chinese language? Well, I tried searching in Chinese and it seems that the best program is Abbyy FineReader. There is also finereaderonline.com website that is the cloud based version and if you register with your email you get 10 page credits, so you can try how accurate it is before purchasing it.

It seems that also the open source Tesseract library works well enough with Chinese language if you know how to program.

November 28, 2018 at 06:45 AM

I use FineReader, but only to enable searching and tap-to-lookup in Pleco for scanned pdf files. The results are not perfect but worth the time and money.

November 28, 2018 at 09:58 AM

12 hours ago, 陳德聰 said:

Is there an alternative magical moderately priced program that does not involve uploading my clients’ sensitive documents to the internet in order to produce a workable text version of the contents of my PDFs?

If your OS is Windows 10 or Windows 7 Ultimate (the multilingual version) and you have Word (Microsoft Office) 2013 or younger, you already have that magic tool. Go to your pdf file, don't open it > right click and select 'Open with' > select 'Word' from the list of files (or select 'Other' and sear for Word)

Let Word do its job (you may have to 'allow editing' and click on 'pdf files' if Word gives you a list of possible conversions to choose from)

If you're using an earlier version of Windows10, you may need to download the Chinese language packages from Optional Windows Features and include OCR and 'rare characters' in the options. Ideally, the latest version of Windows 10 (v. 1803, soon-ish to be updated to v. 1809) Office 2016 or Office 365, are much easier and faster in this kind of job.

Can't say I have tested this extensively and in-depth, but have been using this method for years and never noticed any problem with either simplified or traditional characters.

For some pdf files where Adobe Reader can't figure out the font, I use Kingsoft (金山) PDF (there's also a free full Kingsoft Office suite that includes pdf to Word conversion) but they may have vulnerabilities, make sure your devices are fully patched up (I haven't experienced any problem with the pdf reader).

http://www.kingsoftstore.com/index.

(Edited to add links to Kingsoft)

November 28, 2018 at 10:03 AM

It very much depends what you're working with. Something that was originally a word document, has been saved as a pdf, and for some arcane reason the word document is no longer available - not too bad. Scans of handwritten medical notes in pdf format - good luck.

Thankfully it's not something I have to deal with very often. A couple of times in the past I have simply paid someone to type the damn things out for me.

I also suspect OCR is a bit like voice recognition now - for the best results, you need to be accessing cloud servers.

November 28, 2018 at 11:36 AM

I hear good things about ABBYY Fine Reader, although none of the people who say the good things use it for Chinese.

The last time I was asked to translate a scanned Chinese periodical article, I promised the client a 10% discount (off a rate inflated by the same amount) if they could get it to me as a flat text. They tried but couldn't. I used half that amount to hire a Chinese student to type the article up for me and fed that flat text to the CAT.

I use MemoQ, haven't tried feeding it PDF's, but for regular documents it works alright for Chinese, as far as I have used it. (Most of my translating from Chinese is literary, and I don't use the CAT for that.)

November 28, 2018 at 01:16 PM

14 hours ago, fabiothebest said:

It seems that also the open source Tesseract library works well enough with Chinese language if you know how to program.

I use Tesseract often to OCR English PDFs. I happen to use a front-end for it called OCRmyPDF that has a command line interface, but I'm sure you can find a more user-friendly open-source implementation. Programming skills shouldn't be necessary.

I was curious to see how well it does with Chinese text, so I fed it a page from a low-quality scan of a textbook. The results weren't perfect, but they were pretty good for a free piece of software!

Screenshot of Input PDF:

OCRmyPDF command:

Quote

ocrmypdf --sidecar -l eng+chi_sim testpage.pdf testpage_ocr.pdf

Output Text File:

Quote

参考译文

老师 : 那是谁的衬衫 ?

老师 : 戴夫，这是你的衬衫吗 ?
戴夫 : 不，先生。这不是我的衬衫。

戴夫 : 这是我的衬衫。我的衬衫是蓝色的。

老师 : 这件衬衫是蒂姆的四?
戴天 : 也许是，先生。蒂姆的衬衫是白色的。

老师 : 蒂姆
蒂姆 : 什么事，先生 ?

老师 : 这是你的衬衫吗 ?
ay}: 是的 ,先生。

老师 : 给你。接着 !
蒂姆 : 谢谢您，先生。

November 28, 2018 at 01:28 PM

Yeah, I didn't search, but I thought there might be a frontend for it..anyway I'm good enough with Python (there is a python implementation), otherwise it's in C++..just thought not everyone is..the CLI option seems easy to use. I just checked and there are many GUIs available and also versions for Android and iOS.

November 30, 2018 at 02:11 AM

Last time I checked (2016), Adobe's solution was pretty dumb. I'll look into whipping something together after NIPS.

October 27, 2019 at 09:23 AM

One alternative is Onenote (2016 for example) . Unedited results of the above text

参考译文
老师 · 那是谁的衬衫 ?
戴夫，这是你的衬衫吗 ?
老师
戴夫
不，先生。这不是我的衬衫。
戴夫
这是我的衬衫，我的衬衫是蓝色的
老帅
这件衬衫是蒂姆的吗？
戴人
也许是，先生。蒂姆的衬衫是白色的。
老师
蒂姆！
蒂姆，什么事，先生 ?
老师
这是你的衬衫吗 ?
蒂
是的，先生 “
老帅
给你。接着！
蒂的
谢谢您，先生。

Sign In

Adobe OCR Woes - Alternatives? a.k.a. What are you all using for your Chinese OCR needs?

Recommended Posts

陳德聰

Link to comment

Share on other sites

fabiothebest

Link to comment

Share on other sites

wibr

Link to comment

Share on other sites

Luxi

Link to comment

Share on other sites

roddy

Link to comment

Share on other sites

Lu

Link to comment

Share on other sites

大块头

Link to comment

Share on other sites

fabiothebest

Link to comment

Share on other sites

Hofmann

Link to comment

Share on other sites

nnt

Link to comment

Share on other sites

Join the conversation