Jump to content
Chinese-Forums
  • Sign Up

BLCUP e-book .opz format


markhavemann

Recommended Posts

I'll post this here in case somebody else comes across the same problem.

 

I bought an e-book at https://www.blcup.com/. Somehow the e-book version was 100rmb but the physical book only costs 30 or something Taobao. But oh well, it's worth it if I can just carry a tablet to class instead of a bunch of heavy books. 

 

After paying for the book I was really annoyed to find out it's in some weird .opz format, and you need to download their own really crappy reader to open it. There is also nothing on the internet about the opz format or converting it to a better format. 

 

Anyway, here's what I figured out: 

 

  1. Rename file to .pdf
  2. open with PDF-XChange Editor
  3. It will open but say there are errors and ask if you want to save a new, fixed file. 
  4. Save it as a fresh PDF that can be read in any application

 

Unfortunately the text seems to have some weird encoding issues so copying to another application just results in garbage (not so great for quick looking up of characters). I'm trying to figure this out and I'll post the solution if I do. 

 

  • Like 1
  • Helpful 2
Link to comment
Share on other sites

As far as the copy paste thing goes, you can probably use pleco to ocr the parts you want to copy. Since you're using it on a tablet or phone

As far as the copy paste thing goes, you can probably use pleco to ocr the parts you want to copy. Since you're using it on a tablet or phone

Link to comment
Share on other sites

1 hour ago, thelearninglearner said:

As far as the copy paste thing goes, you can probably use pleco to ocr the parts you want to copy. Since you're using it on a tablet or phone

Yeah looks like I will have to resort to that. Not as convenient as copying and pasting, but I guess it will do.

Link to comment
Share on other sites

3 hours ago, markhavemann said:

Yeah looks like I will have to resort to that. Not as convenient as copying and pasting, but I guess it will do.

Maybe also check out some advanced pdf readers(I'm thinking Adobe reader) . Might have some features that can help. Extra conversions 

Link to comment
Share on other sites

13 hours ago, thelearninglearner said:

Maybe also check out some advanced pdf readers(I'm thinking Adobe reader) . Might have some features that can help. Extra conversions 

I've never had so many PDF editors installed on my computer at once. Eventually I found a tool to look at the "unicode mapping" tables of the PDF. Looks like the character appearances were saved as vector "glyphs" so that they could be displayed, and a text character is link to each one, but when the PDF was created it didn't specify WHICH unicode character was linked to which glyph, meaning it's completely unrecoverable without identifying each character manually.

 

5 hours ago, 大块头 said:

ocrmypdf may be a solution

I eventually settled on PDF-Xchange's built in OCR, which seems to work much better than Adobe for some reason, and it also had the option to OCR existing "text" which saved me having to flatten each page into an image or anything like that. 

  • Helpful 1
Link to comment
Share on other sites

Another option (for next time!?) would be to buy the book and manually scan it in. It sounds tedious but it’s not that bad now that phone scanners are decent. I scanned in a 300 page textbook myself and it took less than an hour. You could also do a chapter at a time if you wanted to (probably takes less than 5 minutes). That hour included taking the scans and tweaking a few of them. It’s not as perfect as it would have been if there was a pdf actually available (it’s an old book) but for personal use on my iPad it’s great. At least it is a pdf file that can be opened by any standard reader. 

  • Like 1
Link to comment
Share on other sites

I'm guessing the OPZ format used some kind of character map and that's why you get garbage out. To get it to be proper text data, you would need to know the mapping of the codepoints, which might be relatively easy if they've just shifted them over by 1000 or something, just to make copy-pasting not work, but if it's a full remapping that might be harder. Given the size of the book, my guess is that they haven't done a complete remapping as that would cut down on the size dramatically (assuming they didn't do a random shuffle just to prevent copy-paste).

 

If they have just shifted them over, you could reverse engineer it by just looking at the glyph, finding the relevant Unicode codepoint, and calculating the difference from the codepoint that sits under the glyph in the file. Check against a few characters to be sure, and if two or three map the same way, probably that's what they've done.

Link to comment
Share on other sites

8 hours ago, NinKenDo said:

 

If they have just shifted them over, you could reverse engineer it by just looking at the glyph, finding the relevant Unicode codepoint, and calculating the difference from the codepoint that sits under the glyph in the file. Check against a few characters to be sure, and if two or three map the same way, probably that's what they've done.

That's a good point. I've noticed that copying "的人“ pretty consistently gives ".¶" while 的 alone is "3" and 人 alone is "+" so it does like like there is method to the madness but it's slightly beyond my own expertise unfortunately. 

 

I've uploaded the pdf here if you or anyone else wants to try crack the code.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...