Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

In keeping with Chinese Forums' giveaway month, Roddy has kindly offered to sponsor 10 licences for Chinese Text Analyser.  To get one, download the program and from the main menu, select:  Help -> Send Feedback and send me a message stating why you'd like a licence.

 

Also include your name, email and physical address.  I never use your physical address, I just add it to the licence to discourage people from sharing them :-)

 

(Note: consideration will only be given to requests sent through the Send Feedback dialog of the program).

 

Also mention your Chinese Forums username, because as with the book giveaway, you'd be expected to write a review, so if more than 10 people ask, consideration will be given to how long you've been a member, the posts you've made and so on.

 

Update: These are now all gone.

Link to comment
Share on other sites

Imron was kind enough to give me a license in exchange for a review.

 

I wasn't sure what to expect, or how useful it would be to me, but I dusted off my ancient Tablet PC(!) with Windows Vista(!!) and installed it. 

 

I'm impressed.  The program runs quickly, even on a machine on which everything else is pretty slow.  It's very handy to be able to quickly go from a bunch of text to the list of words you need to learn to understand that text.  You can then export the list of those words (and the list of words you know in the text, if you wish).

 

I understand the program is still in development; I noticed a few rough edges:

 

- At least on my ancient PC, the display of the Chinese characters was pretty rough in the program, a bit rougher than shown here.  It was OK to use the program to segment text, but as it stands I could not use the program to read a document.  I tried a couple different fonts and increasing the font size.

 

- There needs to be a way to manually change segmentation.  Hard to emphasize enough how important a feature this is.

 

- I would really like to see a way to export to Pleco flashcard format.

 

So in summary, I think right now it's a very fast, useful utility.  Give it a try; it won't take long to see what it can do. :)

  • Like 2
Link to comment
Share on other sites

I'm impressed.  The program runs quickly, even on a machine on which everything else is pretty slow.

High performance has been one of the core design goals from the very beginning, so it's good to hear it holds up even on older machines.

 

I would really like to see a way to export to Pleco flashcard format.

It exports to a tab separated file which is supported by Pleco.  See here.  Just make sure you export the fields that pleco needs.

 

the display of the Chinese characters was pretty rough in the program,

Can you please post a screen shot so I can see what you mean?  Can you also maybe find an example (e.g. using another program) of what you consider to be acceptable so I can try and figure out a way to bring it up to that level.

 

There needs to be a way to manually change segmentation.  Hard to emphasize enough how important a feature this is.

In the future I plan to support a range of different segementers as well as manual segmentation.  Probably not until the 1.0.0 release though.

  • Like 1
Link to comment
Share on other sites

 

Can you please post a screen shot so I can see what you mean?  Can you also maybe find an example (e.g. using another program) of what you consider to be acceptable so I can try and figure out a way to bring it up to that level.

I've tried to attach examples.  I realize Word is probably a high standard to meet, but perhaps something can be done.

 

 

It exports to a tab separated file which is supported by Pleco.  See here.  Just make sure you export the fields that pleco needs.

Sorry I missed that.  You might want to consider adding named export options with preset fields.

 

 

post-3924-0-34831600-1399543318_thumb.png

post-3924-0-49064500-1399543329_thumb.png

  • Like 1
Link to comment
Share on other sites

I realize Word is probably a high standard to meet, but perhaps something can be done.

I have quite high standards :-)  So I'll try to see what I can figure out, because the text looks really bad there - it doesn't look anywhere near that bad on my machine.  What fonts are you using in both Word and in Chinese Text Analyser?

 

You might want to consider adding named export options with preset fields.

Yes, I plan to do something like that eventually, it's just a matter of trying to figure out how to fit it in nicely with the dialog.

Link to comment
Share on other sites

It sounds like it might be a perfect tool for me! I'm a self learner who loves to learn by reading, but since I'm still at the beginning of my Chinese adventure, it's really difficult to find texts to read at an appropriate level (other than in textbooks). I'm definitely going to download it and check it out and maybe get a free license if they're still available :D

  • Like 1
Link to comment
Share on other sites

I'm a little conflicted.

 

I wish I'd had the tool 10-15 years ago. Then again, there wasn't that much Chinese text available online then, nor were computers powerful enough to have the performance required.

 

Still, I'm at the point (whether skill-wise or free-time-wise) that every hour spent using study tools is an hour I'm not reading novels or watching television serials.  I have about 10 novels and at least 30 television serials stacked up (including 3-4 I want to re-watch) and waiting, so I really need to prioritize my time.

 

On the other hand, I always want to improve my Chinese, and I have discovered that my current tendency to recognize characters/pronunciation by context rather than actual composition is both lazy and dangerous (potentially embarrassing, at least).  Not to mention, there are now at least a dozen characters that I have learned the meaning of via reading that I have no idea what the pronunciation is (I often read away from any electronic device, resulting in it being inconvenient to look up characters merely for pronunciation), and I expect that to grow as I continue reading.

 

So it is quite possible this tool is perfect for me.

 

What other tools are required to maximize the usefulness of your Text Analyser?  Do I need Pleco?  Anki? Skritter?

I assume that these other tools aren't required, but I want to maximize the utility and success of using it.

 

Or to put it another way: what is the best suite of applications to use with the Text Analyser?

What is the cost for full versions of the best combination of supplemental applications?

  • Like 1
Link to comment
Share on other sites

Hi. I'm an upper intermediate level self study student. I'm a heritage learner, so my spoken chinese is much better than my reading. I think this tool would be a great asset to helping me acquire better reading skills. I anticipate using it with Pleco. 

 

I downloaded your trial and requested participation in your license giveaway as a technologically illiterate guinea pig! I'm trying to get started, but I'm actually confused on the first step of importing lists of known words. This is technically a Pleco question, not a question directed towards your text analyzer, imron, but do you know how to only export a list of words from Pleco that are "known"? In Pleco I define "known" as having a score above a certain value.

 

Like I said, I'm not computer savvy, but I am willing to write a review from the stupid user's point of view, lol. 

Link to comment
Share on other sites

 

What fonts are you using in both Word and in Chinese Text Analyser?

Good news: both were using SimSum, but CTA was 12 point, while Word was 10 point.  When I changed CTA to 10 point, the characters looked fine.  So the issue is much less severe than it seemed last night, when I tried different fonts and increasing the font size in CTA.

Link to comment
Share on other sites

Never mind my previous post. I figured out how to export the known cards from pleco. I ended up making a new category, and performing a search with a score restriction (>1000) and putting them all in one category. Then I exported just that category as a word list. And I successfully imported my known words into the text analyzer!

Link to comment
Share on other sites

Still, I'm at the point (whether skill-wise or free-time-wise) that every hour spent using study tools is an hour I'm not reading novels or watching television serials.

 

Chinese Text Analyser is still useful even if you only intend to read paper books and watch TV series.  The reason being that you don't actually need to use Chinese Text Analyser to read the texts.  CTA is used to identify unknown words in texts, so simply find the electronic version of the book, or subtitles of the TV series, load it in, find the top X number of words that you don't know sorted by frequency and first occurrence, and then prelearn those words so that you already know them when you come across them in the book.  Repeat as often as necessary - say with the top 10 unknown words daily.  This works especially well if you have an electronic version split in to chapters as you can just load a chapter at a time and prioritise that vocabulary.  You could end up spending less than a minute a day in CTA if all you did was open the current chapter and export the next 10 unknown words (though obviously you'd then still need to spend time to learn those words).

 

What other tools are required to maximize the usefulness of your Text Analyser?  Do I need Pleco?  Anki? Skritter?

I assume that these other tools aren't required, but I want to maximize the utility and success of using it.

 

Chinese Text Analyser helps you identify words in a piece of text that you don't know.  It doesn't provide any way to study those words separately.  As such, to get the most amount of benefit from it you should have some way of studying the lists of words that you export from the program.

 

This can be Anki or Pleco, or it can be an Excel spreadsheet, or you can print out the lists and manually look them up in a dictionary.  CTA just helps you prioritise lists of words to learn based on context.  However you normally learn words, you should then use the exported lists to learn them.  You can export a large variety of different fields depending on your needs, including the word itself, pinyin, English definitions, sentences with the words, cloze deletion sentences containing the words and so on.

 

If you don't have Pleco, I would recommend getting it regardless of whether you purchase CTA or not.  At a minimum, you should purchase the flashcard module and because you are at a level where you can read Chinese novels, also the Xiandai Hanyu Guifan Cidian Chinese-Chinese dictionary.  Again, this is regardless of whether you purchase CTA or not.  I'd also recommend the ABC C-E English dictionary as a backup because it has a larger number of words than Guifan.

 

From Pleco's site, Flashcards + Guifan will cost USD ~30.  ABC will be USD 20 on top of that.  It's well worth exploring if there are any other options you might like.  Personally, I have the professional bundle + Guifan, which will set you back USD $100.  This is more than worth it, especially considering you can upgrade across devices and versions free of cost.  I've been using Pleco now for 7 years so currently it's cost me ~$0.04 a day (actually a little bit more, because prices used to be higher - Guifan originally cost $60).

 

CTA is then AUD $10 on top of that (or free if you get a licence while they are still on offer).

 

If the price of Pleco still seems a bit steep, then Anki also provides a good way to revise words, and maybe it's something you can download first to see if the workflow works for you.  The problem with Anki is that it doesn't have a built in dictionary so flashcard creation has to be done manually - which is time consuming and a pain.  CTA will take most of that pain away, because you can simply import in to Anki words exported from CTA, however CTA currently only provides definitions from CEDICT, which is not as comprehensive as the paid dictionaries available in Pleco.

 

Pleco is also great because it's quick and easy to look up a word, and then (this is the key), trivial to add looked up words to a list for later revision. The problem with this approach is that it's still a pain to break from reading to look up words and so you end up not always looking up words that you should, or maybe you don't bother to add these words to a list for review because you 'already knew the word, but just wanted to check' (i.e. you didn't really know the word but you've pretended to yourself that you did).

 

CTA helps get around this by allowing you to prioritise and prelearn words in a given piece of text that you intend to read.  It also gives you a good indication as to whether or not a piece of text is appropriate for your current level.  For example, say you have 10 books to read, but you're not sure which one to read next.  The best one to read first is the one that has the lowest number of new words because as you read that, you'll pick up a whole bunch of new vocabulary which will then make the next book easier, and the next book easier, and so on.  Assuming you've been using CTA for a while and have a reasonably accurate list of known words, then with CTA, you can load up each book and see which one has the lowest number of unknown words and then read that one.

 

When I changed CTA to 10 point, the characters looked fine

It seems really strange that decreasing the font size improved the clarity!  Still, if it works, then great.

 

Then I exported just that category as a word list. And I successfully imported my known words into the text analyzer!

Excellent!  Glad to hear it works.

  • Like 3
Link to comment
Share on other sites

Thank you for that comprehensive reply.

I'll probably download it soon.

 

 

Quote

When I changed CTA to 10 point, the characters looked fine

It seems really strange that decreasing the font size improved the clarity!  Still, if it works, then great.

 

I've actually seen that before with one of the fonts (in word, I think...maybe NJ Star).  It comes from the pixel-mapping (or vector line point definitions?) being designed for one size...if it gets too big, the shapes distort, so it can actually look better in a smaller font.

 

I'm probably not explaining it correctly, but the point is, I have actually seen that happen before with Chinese characters and font sizes.

Link to comment
Share on other sites

hmm some of my formatting disappeared when I pasted the above post. I originally had the exported results in tables to demonstrate the export issue I found. I'm not sure how to get my tables back, sorry. Imron, I'll email a copy to you directly so you can better see what I mean. 

Link to comment
Share on other sites

Thanks for the comprehensive review!

 

It takes literally two button clicks to install on windows

I put in a lot of effort to make it only take two clicks! Nice to see this appreciated.

 

"I think the only down side to doing it this way is that you have to do a manual search to add new cards to your Known category and manually export the known card list into Text Analyzer each time if you want to keep your known words updated.

 

Now that you have Chinese Text Analyser set up, going forward, the alternative is to export words from Chinese Text Analyser, and then import them in to Pleco.  When you export, there is an option to automatically mark exported words as known (with the expectation that if you don't know them yet, you will after importing them to another program like Pleco) and this way the lists should stay relatively in sync.  Once every few weeks or months you could then do an export from Pleco just to catch any you had added outside of CTA.

 

 

"I imported my list of known words from Pleco, and it imported, but I would have liked to see a success message or something to let me know it worked ok. Rather, it just took me back to a blank screen, and I wasn't sure if anything had happened."

 

Added to my list of things to do.

 

 

"However, I did notice that if you have more than about 7 or so tabs open, you will be unable to maneuver to the tabs on the right, since there is no way to access tabs that don't fit on the screen."

 

Improving this is on my list of things to do.  In the meantime, you can use Ctrl-Tab and Ctrl-Shift-Tab to cycle between tabs (even those off-screen).

 

"I think it would be really helpful if you curated the available fonts to only those that display Chinese text."

 

Windows has a mechanism for doing this, however it can be a little too strict, and sometimes results in no fonts shown.  Fixing this up is on my list of things to do, but for the pre 1.0.0 releases I figured better to have too many fonts, than too little.  Remembering the last font used across sessions is also on my list of things to do.

 

 

"but it might be nice if you could include a more brush script-y type font."

 

Unfortunately, due to font licensing costs, including a nicer font is probably not really doable at this price point - at least not until there's a much higher volume of users.

 

 

"I'm not really sure how important the File Statistics are, but I guess it doesn't hurt to have them there."

 

Not really important, but possibly interesting for some people.

 

 

"I recommend you make a clearer division between the "Total" section and the "Unique" section"

 

Already on my list of things to do.

 

 

"One additional statistic that I think would be good to have is Number of Unknown words in a document."

 

Added to my list of things todo.

 

 

"I'm honestly not sure what "Cumulative % Frequency" means, and I was not able to figure it out."

 

It's similar to column 4 of the Jun Da frequency lists.  It's just the sum of all frequencies for that word and all words more frequent than it.

 

 

"I'm not sure how helpful the "First Occurrence" column is either. I haven't determined a use for it."

 

First occurrence is actually really useful for prioritising words.  For example, you can export the top 100 unknown words by frequency, and then sort them by first occurrence.  That way you can learn the most frequent words in the order that they will appear in the text you are reading.

 

"I think an additional feature I would like to see would be a set of left right arrows so you can go to the next occurrence of the word of interest fairly easily if you are in a long document, and see each place the word is used in context."

 

Double click again, and it will take you to the next instance and so on.  If you have used the Edit->Find dialog, then 'F3', 'Ctrl-G' or 'n' will take you to the next occurrence of the word without needing to have the dialog open (no shortcuts for previous words yet, that is on my list of things to do).

"there is no Bookmark feature in the program.

 

Good idea.  Added to my list of things to do.

 

 

"It may also be helpful to have an option to record notes in certain sections of the books or mark up places that you had difficulty reading and may want to go back and re-read after studying the vocabulary in that section.

 

Added as a low-priority to do item.

 

 

"Now I'll review the Export settings. I have tried the File --> Export --> To File settings. I have never tested the To Email, because I think it works through Microsoft Outlook, and I do not use Outlook."

 

The program just opens your default email application as specified by the OS.  Actually this feature is not that useful because it only works with small amounts of data, and I may end up removing it all together.

 

 

"I do not know what the Pre and Post mean and what that tab is meant to do.

 

A document has one or more paragraphs, a paragraph has one or more words.  Pre and Post allow you to add things before (Pre) or after (Post) each document, paragraph, and word during the export process.  So for example, you could set:

document pre: <html><head><title>CTA Export</title><meta charset="UTF-8"><style>.word:hover {color:red;}</style></head><body>

document post: </body></html>

paragraph pre: <p>

paragraph post: </p>

word pre: <span class="word">

word post: </span>

 

Then you'll get an html file split in to paragraphs and words, with the color changing red when you hover over a word.  Or you could just have them all empty, except have a single space for 'word post' and then you'd get a segmented file with words separated by spaces and so on.

 

 

"I think it would be nice if there was a keyboard shortcut for marking words as known"

 

Unfortunately I can't know where your eyes are looking, and I don't want the reading process to require the user to keep going next, next, next, next with arrow keys or something.  What I will probably do is make double click toggle the known/unknown status.

 

"I'm not sure if Imron had in mind designing the Chinese Text Analyzer as just a tool to aid in picking which books/texts to read , or as a stand alone reader, or both"

 

The hint is in the name: Chinese Text Analyser, not Chinese Text Reader.  I also have plans for a separate reader program, but Chinese Text Analyser can work as a basic reader.  Actually the plan has always been to develop the reader, but the reader required a segmenting engine and so I wrote that first, and thought it was useful enough to release as a standalone program in the interim.

 

 

"I have to admit that I miss having a pop up dictionary feature.

 

Popup definitions will be in the next release (0.99.3), with looked up words automatically being marked as unknown.

 

 

"but as of now I'm finding it hard to give up the crutch"

 

It might be an indication that you need to look at easier texts.  One of the design goals of CTA was to make shortcomings in your ability obvious, rather than letting you gloss over them.  The logic being that by making such things obvious, you can know what you need to focus on to improve.

 

So if there is a point of pain, then it is a possible indication of something in your learning that you need to address.  CTA will not coddle you and is meant to give you an accurate view of your real ability.

 

 

"I personally think that the default should be set to Unknown instead of All"

 

Added to my list of things to do.  Note however that currently the program will save your last choices, so once you set the list as 'Unknown' it will remember this the next time you export.

 

 

"I'm not really sure what the difference between Word and Simplified

 

Word is the actual word as it appears in the text.  If your text is all in Simplified, then Word and Simplified will be the same.  If however your text was all in Traditional, then Word and Simplified would be Traditional and Simplified respectively.

 

 

"I'm not sure what dictionary is used,"

 

Currently CEDICT, but with the possibility of supporting other dictionaries later.

 

 

"some of the fields exported with the previous sentence's period preceding it."

 

This is a bug, it should not be doing this and should stop on full-stops and/or newlines.  Can you please provide me a complete paragraph of text with the problem (or send me the source text via email), and I'll look in to it.

 

 

"You can see that the Cloze sentence got put on the second row."

 

Likewise a bug that I will look at if you send through example source text.

Thanks again for such detailed feedback. I'll try to address many of those issues in the next release.

  • Like 1
Link to comment
Share on other sites

Imron - I'm occasionally using as a reader at work and finding it very helpful. One feature that I'd like to see (which perhaps I'm missing somehow) is to double click to change the status of the word. Currently it seems only a right click to menu can be used to change a word from unknown to known (or vice versa). Out of habit I find myself frequently double clicking - any chance of adding this in the future? [Edit: just noticed this as coming from your comments above - great!]

 

Also, I'll write a review myself in the near future.

Link to comment
Share on other sites

Wow, great job! Works so quickly! So much attention is given to different use cases (browsing through the selected words is one of them). I'm so happy that developers prioritize performance, it is so crucial. It works great on Wine.

Yes, I agree with someone's comment about the good tools coming so late :)

 

Question:  How are the word combinations analyzed? Is it based on the CC-CEDICT? If it does, is it on your roadmap to allow users to choose how to analyze the combinations? Just a thought for now. I'll probably use it for classical Chinese, therefore an additional combination list will be needed I assume. 

 

My two cents will be, together with other comments on this thread:

1) I understand your argument about "bad practices" in looking up every word. However adding pinyin,and a translation (or a link to ZDIC or another online dictionary) as an optional column can do the work and make it more powerful.

 

2) Giving the possibility to 1) create new doc in the software and 2)  to save it 3) or edit existing document would be a great addition. Now the clipboard is a bit limiting - I usually scrape texts from the internet and would like to throw it in directly into CTA.

 

Kudos!

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...