Jump to content
Chinese-Forums
  • Sign Up

ChinesePod PDF-Problem


HerrPetersen

Recommended Posts

Dajia hao,

I downloaded some PDF-files from ChinesePod with an trial account. I found some of the material pretty usefull so I tried to cut out some sentences in order to learn them via my learning programm anki.

However when trying to copy the text out I ended up with nonsense like this:

Hß`Tû` :

w´ ang t ` aitai, wˇ o zˇ enme g¯ en nˇ ı li ´ anx` ı? nˇ ı yˇ ou shˇ ouj¯ ı ma?

Mrs. Wang, how can I get in touch with you? Do you have

a mobile phone?

So basically the hanzi are all messed up and the pinyin is somewhat messed up.

I also have some older Pdfs where this works just fine. My question is:

Is this due to the chinesepod staff trying to copyright their stuff more strictly? Or is this due to the pdf being created with a Chinese computer or some other computer issue?

Are there any solutions (beside copying it all by hand)?

I remember downloading subtitles for a Chinese movie, where the subtitles ended up somewhat like in the lines above. So my guess (and hope) is, that it is some kind of computer issue.

Link to comment
Share on other sites

I create spreadsheets in excel when I gathered enough sentences I use anki's import function; here is some typical material I put in my excel sheet (this is from a song):

L0001 祝你生日快乐。 zhù nǐ shēng rì kuài lè, [sound:L0001.ogg]

L0002 祝你幸福, 祝你健康。 zhù nǐ xìng fú,zhù nǐ jiàn kāng. [sound:L0002.ogg]

L0003 祝你前途光明。 zhù nǐ qián tú guāng míng。 [sound:L0003.ogg]

L0004 有个温暖家庭。 yǒu gè wēn nuǎn jiā tíng。 [sound:L0004.ogg]

There are some fields empty here I use for definitions and translations.

Link to comment
Share on other sites

There was a thread about this, for another source of PDFs, recently.

Without actually seeing it (and please don't post it) I would say it is almost certainly the use of poorly designed software, which embeds a font subset, without using the standard encoding, rather than conspiracy.

PDF provides other ways of protecting intellectual property, so if they really wanted to protect it, you wouldn't be able to cut and paste at all, using Acrobat Reader. Being able to do cut and paste, were permitted, is one of the design goals of PDF, and, if authors were to obey the authoring rules, should be possible, if copying is marked as permitted.

Actually using bogus encoding is potentially illegal, in some contexts, because it makes the material unavailable to "assistive technology", used by the disabled, particularly text to speech convertors. I think the US legislation is very weak in this area, except for federal government projects. The UK legislation is a bit stronger, but not enforced.

Link to comment
Share on other sites

@davidj - the PDF isn't a great standard and the problem here is with automated PDF generation as opposed to manual PDF generation. I'd be very interested if you know of any software that can batch-process PDFs and support copy-and-pasting.

@HerrPetersen - we provide text transcripts of everything at Popup Chinese, so you may find the site a suitable replacement. As long as we're in beta you can get a free premium account by using the voucher number 2008AOYUN. Really depends on your level of difficulty though - ChinesePod excels at producing great Newbie content and has quite sizable archives while we are focusing more on intermediate and advanced students and are quite new (still in beta, in fact).

Link to comment
Share on other sites

Although I have no experience of the Adobe tools, and negligible experience of Chinese PDFs, I would suggest that randomly encoded font subsets are likely to be the result of the software that creates the PostScript file (or prints to pdfwriter), rather than any Adobe tool.

I'd be rather suspicious of any third party tool that directly generates PDF, except for ghostscript. At least one just generates a bitmap with no text underlay, which goes completely against the spirit of PDF, but I suspect many violate the spirit in other ways.

ghostscript requires a PostScript intermediate, which is where I think the problem would arise here. It is a good tool for batch processing, but I have little experience with its use with Chinese.

A non-Chinese, non-PDF example of this problem is that, with the default font handling settings some versions of Microsoft Visual Studio print gibberish on one of our PostScript printers, because it causes its print driver to embed a font set, encoded in the order of first appearence, but the printer (driver) tries to do a font substitution into a similar built-in font.

Link to comment
Share on other sites

I can copy/paste from a PDF created by pdfLaTeX on Ubuntu, but I can't do it from one created by MikTeX's pdfLaTeX. Though in both cases they've got some weird font embedding going on. I also didn't get perfect extraction, as it decided that it would only paste in ',' rather than ','. At least it works, I guess.

Also, OpenOffice seems to be very good about exporting PDFs. Neither of which helps an awful lot, but those are my observations.

Link to comment
Share on other sites

The problem is, as many people had guessed, is that the software used to generate the pdf files is broken. ChinesePod staff admit as much. They say they "want" to fix, but they haven't in over a year, so I wouldn't hold my breath. [This has been discussed quite extensively on ChinesePod and ChinesePod forums.]

The only solution is to use the .html files instead. Those allow valid cut and paste. I think inside each pdf file is a link to the html file. [At least it was last time I checked; but that was a long time ago, as the pdf is useless to me.]

Note that if you prefer traditional, rather than simplified, you can add "trad" before the ".html" or the ".pdf" to get that version.

Link to comment
Share on other sites

  • 1 month later...

from what i understand the problem might arise from the use of ghostscript based utilities. I stumbled upon this thread while investigating, why I cannot copy chinese characters from PDF converted using ghostview or printed using CutePDF that both base upon ghostscript.

It looks like ghostscript by default displays chinese characters using CID fonts, that is a different encoding than Unicode or any other standard used for Asian characters. So copying this "text" does not make sense in any external app that does not expect this.

It looks like it could be possible to configure ghostscript to use some free available Unicode fonts instead, but I've failed to find any instructions of how to do it, just the hints that it might be possible.

Does this ring the bell to anybody? or I'm off in my assumptions?

Link to comment
Share on other sites

  • 5 weeks later...

I had the same problem:

- copying with foxit free pdf reader did not grab all chinese characters. I solved this by using haihai pdf reader. You can copy to clipboard by just marking the text (no need to ctrl-c)

- paste to Word. Chinese character come out well. pinyin is the same mess you mentioned in your post.

- i clean this up with the following macro:

Sub convert_to_unicode()

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¨ u ¯ u"

.Replacement.Text = ChrW(470)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¨ u ´ ü"

.Replacement.Text = ChrW(472)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¨ u " & ChrW(711) & " u"

.Replacement.Text = ChrW(474)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¨ u ` u"

.Replacement.Text = ChrW(476)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¯ a"

.Replacement.Text = ChrW(257)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "´ a"

.Replacement.Text = "á"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = ChrW(711) & " a"

.Replacement.Text = ChrW(462)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "` a"

.Replacement.Text = "à"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¯ e"

.Replacement.Text = ChrW(275)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "´ e"

.Replacement.Text = "é"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = ChrW(711) & " e"

.Replacement.Text = ChrW(283)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "` e"

.Replacement.Text = "è"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¯ " & ChrW(305)

.Replacement.Text = ChrW(299)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "´ " & ChrW(305)

.Replacement.Text = ChrW(237)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = ChrW(711) & " " & ChrW(305)

.Replacement.Text = ChrW(299)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "` " & ChrW(305)

.Replacement.Text = "ì"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¯ o"

.Replacement.Text = ChrW(333)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "´ o"

.Replacement.Text = "ó"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = ChrW(711) & " o"

.Replacement.Text = ChrW(466)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "` o"

.Replacement.Text = "ò"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¯ u"

.Replacement.Text = ChrW(363)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "´ u"

.Replacement.Text = "ú"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = ChrW(711) & " u"

.Replacement.Text = ChrW(468)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "` u"

.Replacement.Text = "ù"

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¯ ü"

.Replacement.Text = ChrW(470)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "´ ü"

.Replacement.Text = ChrW(472)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Selection.Find.ClearFormatting

Selection.Find.Replacement.ClearFormatting

With Selection.Find

.Text = "¨ u ` u"

.Replacement.Text = ChrW(476)

.Forward = True

.Wrap = wdFindContinue

.Format = False

.MatchCase = False

.MatchWholeWord = False

.MatchByte = False

.MatchWildcards = False

.MatchSoundsLike = False

.MatchAllWordForms = False

End With

Selection.Find.Execute Replace:=wdReplaceAll

Not the most elegant way to do this, but it solves the problem.

Link to comment
Share on other sites

Good job, finding a way to solve the pinyin-issue! :-)

Unfortunatly I am using OpenOffice right now - so I can't use it until maybe later :(

I have three sets of pdfs from ChinesePod:

-one is pretty old (couple of years)

-one is kinda old (a year or so)

-and one is new (some weeks/months)

-In the first set characters would copy out perfectly

-in the second set characters would not mark up but when pasting everything was fine.

-and in the third set characters are just messed up.

So I guess your chinese-pod pdfs are from the second set.

It looks to me as if ChinesePod is just trying to make it harder to copy their stuff. (and who would blame them - they are doing a great job and deserve some money)

Link to comment
Share on other sites

I use the same on the whole range of pdfs and so far it works. I download new lessons as soon as they are available and often use this to save bits and pieces.

I have seen other pinyin pdfs from other sites that look the same, so it must have something to do with the encoding. I did not investigate whether that can be changed.

You should be able to write an easy macro for yourself with open office. All it is is searching and replacing.

Link to comment
Share on other sites

It looks to me as if ChinesePod is just trying to make it harder to copy their stuff. (and who would blame them - they are doing a great job and deserve some money)

Given that they also provide the .html versions, which allow full cut and paste, I think it's just incompetence on their part.

Link to comment
Share on other sites

  • 1 year later...

Sorry for the necropost.

I finally came around to this problem again, and tried to install the script dieterlu wrote, however, when creating a macro I get:

Runtime error (450)

Wrong number of arguments

and when starting the debugger it skips to:

Selection.Find.ClearFormatting

I do not have any experience with macros in excel, so could someone please do an explanation for dummies?

Link to comment
Share on other sites

Copy and Paste of the text in CPod's PDF doesn't work, but at the page bottom of the PDFs are links to HTML versions.

You may also change the download link

from: http://s3.amazonaws.com/chinesepod.com/xxxx/yyyyyyyyyy/pdf/chinesepod_Zxxxx.pdf

to: http://s3.amazonaws.com/chinesepod.com/xxxx/yyyyyyyyyy/pdf/chinesepod_Zxxxx.html

(The above links won't work.)

HTH

Link to comment
Share on other sites

Copy/paste does work, if you use the Sumatra Pdf Reader. (The haihai reader doesn't seem to do it any more).

Also a simpler version of an excel sheet repairing the messed up pinyin goes like that:

Sub ZeichenTauschen()

Dim Zelle As Range

For Each Zelle In ThisWorkbook.Sheets("Tauschliste").Cells(1, 1).CurrentRegion.Columns(1).Cells

Selection.Replace Zelle.Value, Zelle.Offset(0, 1).Value, lookat:=xlPart, MatchCase:=True

Next

End Sub

You will have to create a worksheet called "Tauschliste" with the following content starting in a1:

ˇ e ě

` e è

¯ ı ī

´ ı í

ˇ ı ǐ

` ı ì

¯ o ō

´ o ó

ˇ o ǒ

` o ò

¯ u ū

´ u ú

ˇ u ǔ

` u ù

¯ ü ǖ

´ ü ǘ

ˇ ü ǚ

` ü ǜ

Link to comment
Share on other sites

  • 5 weeks later...

I know this is a very old topic, put the problem still seems to exist, copying and pasting from Chinesepod pdfs does not work properly.

A very simple solution is to reprint the Chinesepod pdf to pdf using a pdf generation program.

I use the free version of pdf995, it works fine.

The output looks good and I can copy and paste normally.

Link to comment
Share on other sites

  • 8 years later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...