Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
jannesan

Jihanzi: vocabulary extraction and content recommendation

Recommended Posts

jannesan

Hi everyone,
I would like to introduce you to my little project that some people may find useful.
It is a free website (www.jihanzi.com) with 2 functions:

 

1. Extracting vocabulary (simplified/traditional) from epub (DRM free), pdf or text files

 

You can also match it against your known words in case you have those at hand, so you can find all unknown words in a text. Further information is frequency of occurrence and where a word first occurs (chapter for epub, page for pdf, relative for plaintext). For convenience I added a filter for minimum amount of occurrence.

 

2. Recommendation of books based on your known words

 

If you upload your known words, you can again filter for a minimum amount of occurrence and you can download a list of the books ordered by least unknown words to most unknown words.

This for now is only matching against 155 books which are all recommendations from this popular thread: https://www.chinese-forums.com/forums/topic/2034-what-are-you-reading/ . I actually found around 350 titles that exist on douban and were mentioned, but I could only find text files for 155 books of those.
I am planning to add more books and also movies.


If you have any suggestions, please let me know.

Cheers

 

  • Helpful 1

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

Jan Finster

This sounds like a useful tool! :)

 

(Edited) at first I did not get any useful characters but this seems to be due to some font issues on my Excel. Uploading the .csv to Googledocs everything looks fine. Thanks! :)

 

Share this post


Link to post
Share on other sites
DavyJonesLocker

Feedback

 

I just tried it and uploaded a epub file but a message saying "please select a source file with a valid filetype"  appeared. The epub file is fine in several epub readers though. 

 

When I copy a page of text from the file and just paste into a notepad file (Unicode format) , I still the the same message 

Chrome, microsoft edge (VPN on , off ) same response

 

however saved as UTF-8 seems fine but when i download the CSV file it comes out as gobbledygook, changed extension to .txt and opens fine ion Notepad/word etc

 

 

Share this post


Link to post
Share on other sites
mungouk

There seems to be an issue with opening .csv files in Excel if they have Hanzi in them — it doesn't recognise the unicode.  (At least that's the case on Excel for Mac 16.32)

 

If you're using a Mac, the Numbers app manages it no problem. 

Share this post


Link to post
Share on other sites
jannesan

The CSV file is UTF-8 encoded, I guess Excel opens it with some other encoding.

I have no experience with Excel, but it should be possible to specify a different (than default) encoding when importing a file.

 

9 hours ago, DavyJonesLocker said:

I just tried it and uploaded a epub file but a message saying "please select a source file with a valid filetype"  appeared. The epub file is fine in several epub readers though. 

 

Is the epub DRM protected? I suspect this may be the reason, I can only extract text from DRM free epubs. I should add this to the description.

 

I am using https://calibre-ebook.com/ to manage my ebooks and convert between different formats. You can install a plugin (https://apprenticealf.wordpress.com/2012/09/10/calibre-plugins-the-simplest-option-for-removing-most-ebook-drm/) to get rid of DRM and thus really own your ebooks.

 

9 hours ago, DavyJonesLocker said:

When I copy a page of text from the file and just paste into a notepad file (Unicode format) , I still the the same message 

Chrome, microsoft edge (VPN on , off ) same response

 

You mean when you upload as plaintext? This is strange.

Share this post


Link to post
Share on other sites
DavyJonesLocker
1 hour ago, jannesan said:

Is the epub DRM protected? I suspect this may be the reason, I can only extract text from DRM free epubs. I should add this to the description.

 

 

no idea, i just downloaded it off baidu and opened it in a free Epub reader and seems fine. I can send it to you if you like, small file 

 

1 hour ago, jannesan said:

You mean when you upload as plaintext? This is strange.

 

 

yup, just called file.txt then saved the file again as UT8 and works fine 

 

I'd say its work adding in more information on error codes and general usage on you website, e.g. can't use kindle books etc (I presume)

Share this post


Link to post
Share on other sites
jannesan
54 minutes ago, DavyJonesLocker said:

I can send it to you if you like, small file 

 

Yes, please!

I am using a very common library to check for filetypes, but maybe it doesn't work as well as I thought. I can change it to accept all files with .epub extension and then try to read it on the server.

 

56 minutes ago, DavyJonesLocker said:

I'd say its work adding in more information on error codes and general usage on you website, e.g. can't use kindle books etc (I presume)

 

Yes, I will do that. It is easy to convert to epub with Calibre, but yea that's an extra step to take as a user.

 

Thanks for the feedback:)

Share this post


Link to post
Share on other sites
dougwar

Cool project, can you add a hsk level words to select as know?

- hsk1 ( if I know all words from this level)

- hsk2

 

Etc...

Share this post


Link to post
Share on other sites
jannesan
10 hours ago, dougwar said:

Cool project, can you add a hsk level words to select as know?

 

Yes, good idea:)

I'll let you know when I added this

  • Like 1

Share this post


Link to post
Share on other sites
DavyJonesLocker

hey @jannesan, i tried another file on your website and it comes up with "bad requests"

 Its just a plain Unicode  text file , it opens fine in MS notepad

 

Actually , I'll pm you the file

  • Helpful 1

Share this post


Link to post
Share on other sites
jannesan

@DavyJonesLocker

Thanks for letting me know, I'll look into it and PM you when it's resolved:)

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...