Jump to content
Chinese-Forums
  • Sign Up

Converting Chinese PDF's to HTML


colinuk

Recommended Posts

I have a very long PDF document written in Simplified Chinese that I would like to study. I am unable to use Kingsoft's Powerword with it on windows, nor am I able to use CEDICT / Pera-Kun on my Mac with it to look up words, as it is a PDF. So I wanted to convert the PDF to HTML or Word Format so that I could use one of these methods. However when I try and convert the file using the export feature on Adobe Acrobat Professional 8 (or its 'save as' function, for that matter), the resulting HTML or Word document that is created shows a really wierd character set - its all gobbledygook basically. I can't get it to 'save as' or export to HTML or Word and preserve the chinese characters.

Would anyone know what is going wrong here. Is there anything I should be tweaking to make sure that the characters transfer correctly during export?

Any help would be appreciated.

Cheers

Colin

Link to comment
Share on other sites

It's hard to say, so I'm taking blind guesses.

Have you tried all possible character sets on the resulting HTML file? UTF-8, UTF-16, GBK, GB2312, Big-5, etc?

Can you cut and paste characters from the PDF into other programs (like Word)? Which encoding do they end up in?

Could you try cutting and pasting excerpts out manually if the file is not too long?

Link to comment
Share on other sites

Hi there

Thanks for the thoughts. When I export to HTML from Adobe Acrobat Professional it gives you several options in the settings to choose from: UTF-8, UTF-16, UCS-4, ISO-Latin-1, HTML / ASCII, Use maping table default. I have tried all of them to create the HTML document. Then when I try to open the HTML document in a web browser the characters are all just a mess. I then try and change the view settings on the browsers, going through the full range of possible simplified chinese character encodings that are available (eg ISO-2022-CN, HZ, GB18130 etc etc). None of them display the characters correctly, as they did in the original document.

Likewise if I try to export to a Word document, although there are no coding settings to choose from when exporting, when the I open the exported Word document displayes I go through every Chinese font on my list to try and make the characters display properly, but nonthing has worked so far.

Is this the kind of thing that you are talking about?

Am really at a bit of a loss on this one. I don't know enough about PDF documents to know what is going on in the background really.

Cheers

Colin

Link to comment
Share on other sites

Yep 'tis the boy wizard himself. :lol: Maybe his magic is stopping me from exporting it to HTML!!

There are a lot of really bad translations out there, and many not complete. I have done quite a lot of searching already. The one I have is faithful to the English version, well translated and complete, so would like to stick with it, if i could only get it to export...... :help

Link to comment
Share on other sites

Fair enough, but I'd be surprised if this isn't online already.

There's an Adobe email address which accepts pdf attachments and sends back plain text, but it says it works for 'English and most West European langauges'. I've sent the single page in just in case, but it hasn't come back yet.

Link to comment
Share on other sites

"It's using an embedded subset font. It's quite possibly coded in order of appearence and only suitable for printing or OCR."

David, thanks for that, however, does that mean to say that there would be no tweaks that one could do on Adobe Acrobat Professional or whatever to export the Chinese to another document format.

Roddy, I didnt know Adobe had such a service. I'll try and look into that too, let me know if you get any results from your email though. Thanks

Colin

Link to comment
Share on other sites

See here. Hasn't come back though, so I assume either it doesn't work, or our attempts to feed it Chinese have broken it. Actually, the FAQ does say

Languages requiring double-byte characters, such as Japanese, Chinese, Arabic, and Hebrew are not supported.

Link to comment
Share on other sites

Here is a pdf to word converter and a pdf to text converter. Both claim to support Chinese simplified & traditional characters. (Their pdf to html converter doesn't mention Chinese character support.) I don't know if it will work, but it's free to try. Let us know how it comes out if you try it.

There are also a lot of other pdf to html converters that you can try for free.

Link to comment
Share on other sites

My impression was that CID Identity-H can't produce more than 16 bit characters, so I suspect that it is actually spoofing 32 bits ones by triggering a Unicode escape mechanism. Actually, Windows, and I think Linux, don't support storing UTF-32.

However, PDF is a final form document format. It was never designed to allow conversion back to revisable form (although it was a design aim that you can cut and paste plain text, subject to permission flags, and authors are supposed to make that possible - but not all authoring tools comply; more recently, accessibility requirements have meant that accessibility tools should be able to get at the plain text).

As there should be a revisable form document that underlies the PDF file, you should ask the PDF file creator for that document. If they have given you a copyright licence to convert the document to HTML, that is the least they can do; they really ought to have set security flags on the document if they didn't want that, but it is not safe to infer a licence from a lack of DRM.

Of course, if it really is a derivative of a JK Rowlings work, I believe she is very strict on requiring royalties, so I'm surprised that you managed to get that permission, legitimately.

Link to comment
Share on other sites

Quite possibly - one of the ideas behind the pdf format is that authors can protect content - they're also used for ebooks as the DRM allows documents to be tied to certain computers. Or it could have been done accidentally - at a guess the creator opted to embed the fonts so users wouldn't need to have them installed, with the unintended consequence of making it not-copyable.

There are links to some Harry Potter books in Chinese here, not sure if this one is included.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...