Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

Cantonese:

I made a dictionary and it seems to work perfectly with CTA. You can easily find one already in cedict format but I chose to start with one of the shared anki decks that I trust the most. I exported the cards and worked them into the correct format with excel. I think a lot of people deserve credit for the data but I'm not sure how to apportion it. 

 

The last trick was that it must be named (Program Files/ChineseTextAnalyser/data/) "cedict_ts.u8". 

  • Like 1
Link to comment
Share on other sites

Minor feature request. I've got a text with annoations in it, and I don't see any way of ignoring them for the vocab list or removing them. Right now I'm using a simple sed script to strip them before the program processes them.

 

 

 

一个星期五的上午,天气很好。高明一边吃早饭一边看星期五的报纸[3]

 

Something like that where I have a large number of annotations like [4] or [15] in the text that aren't actually useful outside of the text.

 

I wouldn't consider this to be high priority, but it would be quite useful to have some way of dealing with it when the source text includes things that aren't actually text.

 

And interestingly enough, for some reason most of the icons are now visible on my Linux install rather than none of them. Again, that's not terribly important, but it's kind of interesting that the situation is different now than it was when I first installed. I wonder what changes were made that caused that.

Link to comment
Share on other sites

Cantonese:

 

I have some great news if you're interested in this. Short answer: Just download yedict_20130108.u8 from here, rename it to cedict_ts.u8 and put it in C:\Program Files\ChineseTextAnalyser\data. It's ready to go as is! 哈哈!(But it's in Yale with tone numbers instead of Jyutping.)

 

I might post the long answer in another thread about available Cantonese dictionaries.

Link to comment
Share on other sites

Somewhat ironically, the top Google search result for CEDICT Cantonese Jyutping leads here.  Pity the site seems dead.  I've messaged the author so maybe he still has a copy of the file somewhere.

 

Also please note that files in C:\Program Files\ChineseTextAnalyser\data (including cedict_ts.u8) get overwritten every time you upgrade, so make sure to keep a copy of yedict lying around.

 

Future support for multiple dictionaries will address this problem.

Link to comment
Share on other sites

@imron, I sent you about a page worth of text, hopefully that's enough to see the pattern I'm getting at. And, I'd assume that Wine is just handling the icons better, I wouldn't expect you to be wasting time on that when you've already decided to do a port. The lack of images was never that big of a deal, even if it did lead to a slight amount of inefficiency when we'd have to wait for the pop up to appear.

Link to comment
Share on other sites

Thanks.  I think I've got a better idea of what you want now and will see if there's something I can do.

 

even if it did lead to a slight amount of inefficiency when we'd have to wait for the pop up to appear.

All the popups should list a keyboard shortcut - remember that and you'll get an increase in efficiency :-)

Link to comment
Share on other sites

@imron, I meant the pop up text over the buttons. Without the images being there consistently, one has to either memorize the position of the buttons or use it on a supported platform, which right now is Windows.

 

But, I do see your point, as far as I've noticed they all do list shortcuts. I just tend to use the program sporadically when I'm using a new text so things like that tend not to make much difference one way or another. I'll just do the entire text and then not use the program for a few weeks while I learn the new text.

Link to comment
Share on other sites

  • 2 weeks later...

Are you sure it is Chinese Text Analyser that creates it and not the other program? CTA locks files it has open so if another application wanted to open it it would need to copy it. Also, CTA only creates temporary files in its own directory or in the users temp directory.

Link to comment
Share on other sites

  • 4 weeks later...

I've been following this thread since it started, and I bought a license a while ago, but I've just only recently started using it...

First of all, thank you. This program is amazing - it really is. I'm still at the point where looking at a page of Chinese text is really intimidating, and even if I know it "should" be easy because it's a textbook that is aimed towards an audience at a lower level than me, I still feel intimidated. But I think one of the greatest advantages of this program - if not the greatest - is that it can give you the confidence you need, by telling you pretty accurately how much you know and don't know of a text, to start reading stuff out of textbooks. I thought I was months (if not half a year or more) away from reading my first adult book... Then I decided just for fun to feed 許三觀賣血記 into CTA, as it was recommended as a good first adult book. I fed it in, started reading and clicking away words that I knew, and to my surprise, after reading half a chapter, realized that it was totally unexpectedly in reach (unless it all the sudden gets ridiculously harder or something...). The fact that - because of CTA - I've started reading my first adult book, gives me tons of new motivation and confidence, and I don't think this point should be overlooked.

The second main advantage, as I see it, is being able to load anything into it and judge quickly if you are ready for the piece (based on vocab size). Closely related to the above, but still different, and still very useful.

Anyways, this wasn't meant to be a review - but thanks for making such a great program!

I do have a few suggestions/comments however. (Apologies if a few have been mentioned before - I've read the thread, but I might have forgotten some of the points were already mentioned).

1. I know that you said you're working on improving the word segmentation - and you said that custom segmentation is coming. Or is it here? I'm wondering because I see you can already add your own definition of a word. The only problem with this, is that if, for example, you have characters ABCD and AB is a word, as well as BC as well as CD, and the program segments it into AB and CD, when really it should be A as a single character followed by BC and then D as another single character, it won't let me add my own custom word because BC is already a word, it's just not segmenting it as a word... Sorry, I know that might be a big confusing - let me know if I need to reiterate.

2. After adding a new custom word, is it possible to add a definition for this word?

3. It would be nice to be able to both see a list of all custom words, and be able to delete them if one wanted. 

 

4. Under the words table, I can see total and unique, as well as the number and percent known and unknown for both of these. Under the characters table, I can only see total characters as well as unique characters. Is this on purpose? I know that you're a fan of studying words not characters, as am I, but I still think that giving the reader an idea of how much characters they know how many they will have to learn would help both in motivation to read something, and also in being able to correctly gauge weather or not a piece is suitable for their level.

 

5. I think you should consider adding an option in some sort of settings menu to not allow dictionary pop-ups. Before I found that I could right click and see a dictionary answer, I was just reading it naturally and guessing/skipping the things I didn't know (because on my first reading through I wasn't looking to learn anything - that will come when I export the cards into anki). I think there is value in doing this, and it might help if I could turn off the pop-up definitions in a settings menu somewhere.

6. I know that this is a pre-mature version (or I assume it is, as it's .99.9), but at some point it would be nice to have a small manual to go along with it. Many things are quite self-explanatory, but there are a few things that can only be come across by accident (for example, I just realized that I could highlight occurrences of a word in the document by clicking on the word in the words list).


 

  • Like 1
Link to comment
Share on other sites

This program is amazing - it really is.

I think so too, thanks :mrgreen:

 

Regarding your points:

 

1 - Word Segmentation.  Yep, word segmentation in Chinese is hard and the example you mentioned is but one case of many.  I have a bunch of ideas for improving things however it's the sort of thing that takes a large amount of time for very small improvements in accuracy. E.g. it might take a week working some feature that improves segmentation accuracy from 90% to 92%, however at the moment that week of work would be better spend improving some of the more baseline features.  For this reason, better word segmentation is low on priority compared to other features I want to get in before finally releasing a 1.0.0 version.  In the meantime, feel free to send me real-world examples of incorrect segmentation you come across as that will give me concrete examples to test with.

 

2 - Not yet, but there will be eventually.

 

3 - C:\Users\<username>\AppData\Local\ChineseTextAnalyser\data\words.u8  utf8, one word per line, add and delete as necessary but probably best to do so when CTA is not running.  There will eventually be a GUI for this.

 

4 - It's on purpose, partly because yes I think people should be focused on words but also because there's no accurate way to get the total number of the characters people know from the list of words that they know (because you might know a word, but not the individual characters if you saw them in isolation or in another word).  It's then cumbersome to have people having to mark both words and characters that they know, especially since for the most part, 'known characters' is a useless metric.  Contrary to what you said it won't help you correctly gauge whether or not a piece is suitable, at least not any better (and in most cases far worse) than the 'known words' metric.

 

5 - It should be simple to add something like this.  I definitely want to discourage people from using popup dictionaries because I think they hinder long-term acquisition skills (the current compromise CTA makes is that when you look up a word it forcibly marks it as unknown).

 

6 - Yeah, a manual is also on my list of things to do, however the UI is still in a state of flux and I don't want to put a manual together, and then have to redo a bunch of screenshots and explanations when things change.  There will be one for the 1.0.0 release.

 

Anyway, thanks for the feedback and be sure to let me know if you have any other questions or suggestions.

Link to comment
Share on other sites

Imron --

 

Just a quick update: I continue to use CTA every day. It's become my default reader for the computer, largely replacing Wenlin. It's encouraging to have a sea of red text in long documents slowly turn to black.

 

Any ETA on bookmark functionality? Would sure help when reading files of book length. Currently, at the end of a reading session, I copy my last sentence in a separate file, then search for that sentence in CTA next time I read. It's a bit cumbersome.

 

I haven't forgotten about my promised review, BTW. Eagerly waiting for 1.0 :-)

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...