Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
陳德聰

Adobe OCR Woes - Alternatives? a.k.a. What are you all using for your Chinese OCR needs?

Recommended Posts

陳德聰

Lately, I have been looking into ways to make my work easier and it has been quite frustrating to discover that the paid version of Adobe Acrobat (which thankfully I use for other things than processing my Chinese language documents) is painfully incompetent at running OCR on Chinese text. Often, what I get looks like a casualty of wrong encoding, but in fact it’s Adobe’s honest attempt at recognising the text (!?!?).

 

Am I just extremely behind the times? Is there an alternative magical moderately priced program that does not involve uploading my clients’ sensitive documents to the internet in order to produce a workable text version of the contents of my PDFs? Does Pleco have some kind of PDF-to-Word converter I didn’t know about? Does Chinese Text Analyser do exactly this work? Translators using CAT tools, have you found that they have been able to handle such a task? (SDL Trados seems to have terrible support for Chinese language docs as far as I can tell.)

 

I have stopped ripping my hair out, but I don’t know that I can realistically continue to avoid figuring out how to do this and survive in this market. This man-child needs help please.

  • Good question! 1

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

fabiothebest

Who better than Chinese people know which OCR program is the best for Chinese language? Well, I tried searching in Chinese and it seems that the best program is Abbyy FineReader. There is also finereaderonline.com website that is the cloud based version and if you register with your email you get 10 page credits, so you can try how accurate it is before purchasing it.

 

It seems that also the open source Tesseract library works well enough with Chinese language if you know how to program.

  • Like 1
  • Helpful 1

Share this post


Link to post
Share on other sites
wibr

I use FineReader, but only to enable searching and tap-to-lookup in Pleco for scanned pdf files. The results are not perfect but worth the time and money.

  • Like 1

Share this post


Link to post
Share on other sites
Luxi
12 hours ago, 陳德聰 said:

Is there an alternative magical moderately priced program that does not involve uploading my clients’ sensitive documents to the internet in order to produce a workable text version of the contents of my PDFs?

 

If your OS is Windows 10 or Windows 7 Ultimate (the multilingual version) and you have Word (Microsoft Office) 2013 or younger, you already have that magic tool. Go to your pdf file, don't open it > right click and select 'Open with' > select 'Word' from the list of files (or select 'Other' and sear for Word) 

Let Word do its job (you may have to 'allow editing' and click on 'pdf files' if Word gives you a list of possible conversions to choose from)

 

If you're using an earlier version of Windows10, you may need to download the Chinese language packages from Optional Windows Features and include OCR and 'rare characters' in the options. Ideally, the latest version of Windows 10 (v. 1803, soon-ish to be updated to v. 1809) Office 2016 or Office 365, are much easier and faster in this kind of job.

 

Can't say I have tested this extensively and in-depth, but have been using this method for years and never noticed any problem with either simplified or traditional characters.  

 

For some  pdf files where Adobe Reader can't figure out the font, I use Kingsoft (金山)  PDF (there's also a free full Kingsoft Office suite that includes pdf to Word conversion) but they may have vulnerabilities, make sure your devices are fully patched up (I haven't experienced any problem with the pdf reader).

http://www.kingsoftstore.com/index.

 

(Edited to add links to Kingsoft)

 

 

  • Helpful 2

Share this post


Link to post
Share on other sites
roddy

It very much depends what you're working with. Something that was originally a word document, has been saved as a pdf, and for some arcane reason the word document is no longer available - not too bad. Scans of handwritten medical notes in pdf format - good luck. 

 

Thankfully it's not something I have to deal with very often. A couple of times in the past I have simply paid someone to type the damn things out for me. 

 

I also suspect OCR is a bit like voice recognition now - for the best results, you need to be accessing cloud servers. 

  • Like 1

Share this post


Link to post
Share on other sites
Lu

I hear good things about ABBYY Fine Reader, although none of the people who say the good things use it for Chinese.

 

The last time I was asked to translate a scanned Chinese periodical article, I promised the client a 10% discount (off a rate inflated by the same amount) if they could get it to me as a flat text. They tried but couldn't. I used half that amount to hire a Chinese student to type the article up for me and fed that flat text to the CAT.

 

I use MemoQ, haven't tried feeding it PDF's, but for regular documents it works alright for Chinese, as far as I have used it. (Most of my translating from Chinese is literary, and I don't use the CAT for that.)

  • Helpful 1

Share this post


Link to post
Share on other sites
大块头
14 hours ago, fabiothebest said:

It seems that also the open source Tesseract library works well enough with Chinese language if you know how to program.

 

I use Tesseract often to OCR English PDFs. I happen to use a front-end for it called OCRmyPDF that has a command line interface, but I'm sure you can find a more user-friendly open-source implementation. Programming skills shouldn't be necessary.

 

I was curious to see how well it does with Chinese text, so I fed it a page from a low-quality scan of a textbook. The results weren't perfect, but they were pretty good for a free piece of software!

 

Screenshot of Input PDF:

image.thumb.png.be3dbb3ad3f7767d3fefee913e324fbc.png

 

OCRmyPDF command:

Quote

ocrmypdf --sidecar -l eng+chi_sim testpage.pdf testpage_ocr.pdf

 

Output Text File:

Quote

参考 译文

老师 : 那 是 谁 的 衬衫 ?

老师 : 戴 夫 , 这 是 你 的 衬衫 吗 ?
戴 夫 : 不 , 先 生 。 这 不 是 我 的 衬衫 。

戴 夫 : 这 是 我 的 衬衫 。 我 的 衬衫 是 蓝 色 的 。

老师 : 这 件 衬衫 是 蒂 姆 的 四?
戴天 : 也 许 是 , 先 生 。 蒂 姆 的 衬衫 是 白色 的 。

老师 : 蒂 姆
蒂 姆 : 什么 事 , 先 生 ?

老师 : 这 是 你 的 衬衫 吗 ?
ay}: 是 的 ,先生 。

老师 : 给 你 。 接 着 !
蒂 姆 : 谢谢 您 , 先 生 。


 

  • Like 1
  • Helpful 2

Share this post


Link to post
Share on other sites
fabiothebest

Yeah, I didn't search, but I thought there might be a frontend for it..anyway I'm good enough with Python (there is a python implementation), otherwise it's in C++..just thought not everyone is..the CLI option seems easy to use. I just checked and there are many GUIs available and also versions for Android and iOS.

Share this post


Link to post
Share on other sites
Hofmann

Last time I checked (2016), Adobe's solution was pretty dumb. I'll look into whipping something together after NIPS.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×