Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

Would it be possible to switch between multiple profiles (different wordlists)? Sometimes it would be easier to operate on a blacklist instead of a whitelist, or it would be helpful to import certain lists to quickly create different kinds of flashcards. However, I don't want to go through the trouble of changing the white-list each time. For example, I may want to study a specific set of words in the order that they appear in a certain document. Well, there are a lot of possible uses.

Link to comment
Share on other sites

Multiple wordlists will definitely be happening, and it's already supported in the underlying code, it's just a matter of putting the GUI together.

 

Also, by blacklist/whitelist you can already do this in the export dialog, just make sure to set the filter to 'exclude words on list' or 'include words on list'.

Link to comment
Share on other sites

Just on the portable data issue - I'm currently using CTA on 3 different PCs - my work PC, my mobile PC and my home PC.  

 

Every now and then I export the known list and save it to a text file on a OneDrive folder that is automatically replicated.  

Then I update the other PCs with that file by loading it and marking them all as known.  

It's a bit of a pain.  And it doesn't handle forgotten (known -> unknown) words at all.  Fortunately I don't often mark words as unknown.

 

I'd much prefer if I can have some way to easily manage this situation. 

I know it's not an easy problem to solve, but I just wanted to let you know about this usage scenario and maybe sometime you can consider a solution.

Ideally each PC could access the same list, add any new words, remove old words, after some configuration.  

 

Anyway still love the tool.  Here's one of my other usage cases:

 

1. Load text from a podcast (PopupChinese/ChinesePod/Slow Chinese/QQSRX) into CTA

2. Load podcast MP3 and text into WorkAudioBook.

3. Go through the podcast and match with the audio with the text (focusing on more difficult sentences, I don't bother to match up simple stuff).  Make sure I can understand/hear everything.  Export an SRT file.

4. Use Subs2SRS to make Anki cards for the text.  Import cards into Anki into my library of sentences (not active).  Get like 20-30 new sentences.

5. Go back to CTA and look for words I don't know well.  

6. For each word I don't know well, but want to memorize, make an Anki cloze using that sentence, move into my active reviews so they will show up as new cards

7. Mark that word as known.  Go back to step 5 until out of interesting words.

8. Move the MP3 into a listening playlist which I listen to periodically.

 

This is working pretty well, a few more steps than I would like, but still a good way to study texts and commit them to memory.

I've been using it with textbook passages (I have the audio) and I've improved my ability to recall the vocabulary quite a bit (can tell when my teacher tests me and I'm much more able to recall specific vocab we've studied before).

Link to comment
Share on other sites

I'd much prefer if I can have some way to easily manage this situation.

Syncing between different computers is a relatively high priority on my todo list, and the underlying data format for storing lists of known words already supports convenient merging between different lists (including knowing which words were added and removed), there's just not a GUI interface yet that lets you do that.

Link to comment
Share on other sites

1. Load text from a podcast (PopupChinese/ChinesePod/Slow Chinese/QQSRX) into CTA

2. Load podcast MP3 and text into WorkAudioBook.

3. Go through the podcast and match with the audio with the text (focusing on more difficult sentences, I don't bother to match up simple stuff).  Make sure I can understand/hear everything.  Export an SRT file.

4. Use Subs2SRS to make Anki cards for the text.  Import cards into Anki into my library of sentences (not active).  Get like 20-30 new sentences.

5. Go back to CTA and look for words I don't know well.  

6. For each word I don't know well, but want to memorize, make an Anki cloze using that sentence, move into my active reviews so they will show up as new cards

7. Mark that word as known.  Go back to step 5 until out of interesting words.

8. Move the MP3 into a listening playlist which I listen to periodically

Sounds like an awful lot of work for one podcast, but you have some excellent resources in there that I wasn't aware of. Thanks for the information!

Link to comment
Share on other sites

Sounds like an awful lot of work for one podcast, but you have some excellent resources in there that I wasn't aware of. Thanks for the information!

 

 

 
It's not that bad.  About 30 minutes for a 2 minute podcast, maybe 10 new words in there.  I am listening to sentences and looking at the new words most of the time, so it's actually study.  There's only a little actual work for me to do as everything is cut/paste or import/export.  
Link to comment
Share on other sites

I've now been using Chinese Text Analyser (CTA) for a few days. As I had talked about Learning With Texts (LWT) previously  and both programmes have several overlapping features, I'd like to share my first impressions about how they compare.  I haven't read the entire thread about CTA, so sorry if I repeat things or if there are things I should know but don't.

 

Differences in features:

 

- CTA is designed for Chinese, while LWT should work with any language

- LWT has sound (listen while you read), not CTA. But what's the use of a built-in mp3 player? You can always start two programmes.

- LWT has a kind of "Library manager" that lets you tag, add, delete texts, etc.

- CTA is extremely easy to install (Windows only, for now, but it works nicely under Wine for Linux at home. I also use it as a portable version, from a USB key, but I have to copy the application data manually); LWT runs on a server, which is much tougher to set up.

- CTA has a Chinese parser; LWT has no parser, you parse your texts manually, one word at a time, which is much more tedious. However, dare I say that CTA's parser somewhat leaves to be desired? It produces many many false positives.

- LWT has a very flexible built-in SRS mechanism for reviewing words (in sentences or isolated, L1 or L2, etc.); with CTA, you have to export cards to another programme, like Pleco or Anki (you can do that with LWT too).

- CTA is incredibly fast, even with huge texts, while LWT is very slow (it runs on a server, with an SQL backend…)

- In LWT, you get to choose your dictionaries (and you can define several), but you have to configure them on your own. CTA uses …? What in fact? ccdict? Nothing to set up and no latency, as in LWT.

- LWT can export your text with annotations (translations, pinyin).

 

Difference in philosophy

 

In CTA, if you use the popup dictionary to reveal the meaning of a word, the word is automatically tagged as "unknown". Fair enough, though harsh. In LWT, you assess your own knowledge of the word on a scale from 1 to 5 (plus "Well known"). Each word in your text is colour-coded according to its status, which can hurt the eyes when there are many unknown words with varying statuses.

 

On the other hand, I've noticed that, while reading, I kept on checking and checking the same words because I thought I might have forgotten their pronunciation (much more often than their meaning) - more often than not, only to discover that my first impression was correct. I hope that a positive effect of CTA will be that - when I see a word that I'm *supposed* to know - I will learn to trust my brains and consider that what I hear in my head *is* correct, and that there's no need to check again.

 

Note that I've used LWT for other languages (Latin and Finnish) but I've stopped using it some time ago, because I've chosen to focus on Mandarin exclusively and LWT was a too much of a hassle to maintain on a server for just one language.

 

My first impression is that CTA may be more appropriate for intermediate/advanced learners, because (1) though it provides segmentation, you have to be very careful about the output. When you start learning Chinese, segmenting words is often a problem (where do Chinese words start and end?). In LWT, you're left in the dark. In CTA, everything is parsed but the output might be incorrect; (2) it's much easier and faster to manage longer texts with CTA; (3) I do hope that the main advantage for me, in the long run, will be increased confidence about what I'm supposed to know already. 

 

Though CTA is written for Windows, there's something Unix-ish about that programme, in that it is designed to do one thing, to do it fast and well, and to complement other programmes when needed (like Pleco or Anki for reviewing cards, ccdict (?) to provide a dictionary, etc.) rather than be a resource-hog Swiss army knife.

  • Like 2
Link to comment
Share on other sites

Thanks for the writeup and feedback.  Just to address some points:

 

LWT has sound (listen while you read), not CTA. But what's the use of a built-in mp3 player? You can always start two programmes.

No intention of adding sound just yet.  Like you said, try to do one thing well rather than everything.  If I ever get around to producing graded content subscriptions it will include audio and a playback mechanism, but that's a looong way off.

 

LWT has a kind of "Library manager" that lets you tag, add, delete texts, etc.

This is a much requested feature for CTA, and will be coming after I get the OS X version done and out.  It will also allow searching, exporting and statistics across all texts in a corpus.

 

CTA is extremely easy to install

A lot of effort went in to making it so :D  I'm glad to see it appreciated.

 

However, dare I say that CTA's parser somewhat leaves to be desired? It produces many many false positives.

Guilty as charged.  I have plans to improve it, but as mentioned earlier up, it takes a long amount of effort to get relatively small percentage increases in accuracy so I intend to finish other features first before working on improving the segmenter.

 

with CTA, you have to export cards to another programme, like Pleco or Anki

Once again, the do one thing well philosophy.  There are plenty of good, existing SRS tools, so at the moment it's a more effective use of my development time to work on other features rather than writing yet another SRS program.

 

CTA is incredibly fast

Another thing I'm happy to see appreciated, as this also involved considerable effort  :D

 

CTA uses …? What in fact? ccdict?

Yep.  I've looked in to licensing other dictionaries too, but the user base is still too small to justify the costs.

 

Fair enough, though harsh.

严师出高徒 :D

 

I hope that a positive effect of CTA will be that - when I see a word that I'm *supposed* to know - I will learn to trust my brains and consider that what I hear in my head *is* correct, and that there's no need to check again.

What I hope will happen is you'll export those words, and then spend some time drilling them until you're confident you won't need to look them up in the dictionary the next time. 

 

It might magically happen if you look up the word enough, but if you don't export them, you should at least spend 20-30 seconds going over the character in your mind and setting up a mental trigger so that you won't forget the pronunciation the next time.  The way I like to do this is imagine myself encountering the word at some point in the future and not being sure about it, and then inserting a fake memory of me remembering it correctly.  Then the next time I come across it, I already have a 'memory' of me remembering it correctly, so it's easier to recall and be confident about.

 

I do hope that the main advantage for me, in the long run, will be increased confidence about what I'm supposed to know already.

The other thing I hope it's useful for is in pre-learning vocabulary for new content.  e.g. say you want to read a novel.  You can export the most frequent X unknown words, sorted by first occurrence, and then you can pre-learn those words and you won't need to do as much looking up in the dictionary while reading.

Link to comment
Share on other sites

  • 5 weeks later...

I'm still a frequent user and huge fan of this program, even though I've been dormant on these forums for some time. Two points:

  • I received a copy for free to review early on - although I'd like to support development as I really value this software and look forward to its gradual but steady improvement. Is your preference for me to simply buy a second license via the website, or perhaps use a PayPal (or other) donation? I'm in no rush and flexible, so want to provide the $10 in a way that provides the most support.
  • I'd prefer if we could define the double-click action - currently it marks/unmarks a word as known. My preference would be to have it popup the definition and mark the word unknown. Then I would read a document, popping up definitions as needed, and then add any words not looked up to my "known" list. Make sense?

Again, huge fan and look forward to the personal corpus functionality down the line!

Link to comment
Share on other sites

Glad to hear you're still finding the program useful.  Obviously I'm not going to complain if you buy a second licence, but don't feel the need.  If you really want to support the program, probably a better way is promote it among your friends and convince them to buy a licence :-)

 

I'll look at allowing users to customise double-click.  There will also be a keyboard option for showing dictionary definitions too.

  • Like 1
Link to comment
Share on other sites

Hi Imron,

 

Will CTA eventually support the ability to add custom definitions? Is it possible to edit the dictionary directly?

 

EDIT: I ask because I have added thousands of custom words. It would be great to define (or add pinyin to) the ones I forget more frequently.

Link to comment
Share on other sites

Is it possible to edit the dictionary directly?

CTA will eventually support adding custom definitions, but in the meantime, yes, you can edit the dictionary directly.  A note of caution however - the dictionary will be overwritten if a new release comes out and you upgrade versions.  So make sure to backup the dictionary file before upgrading.

 

Anyway, the file you need to edit is:

 

c:\program files\chinesetextanalyser\data\cedict_ts.u8

 

It's a text file, using the standard cedict format and utf-8 encoding.  Although you can open and edit it with notepad, you need to make sure you save it again as utf8 format.  A better choice is to use something like notepad++, which will automatically preserve character encoding and line formats.

 

If you want to be forward compatible, and guard against CTA overwriting your custom dictionary entries I would do the following:

 

Go to:

 

c:\users\<username>\AppData\Local\ChineseTextAnalyser\data

 

and create a file called cedict_ts.u8, and put all your new words in that file (as above, this should be utf8 encoded, and have entries in CEDICT format).

 

Then I would make a backup copy of the main

c:\program files\chinesetextanalyser\data\cedict_ts.u8

file, perhaps called

c:\program files\chinesetextanalyser\data\cedict_ts.u8.original

 

Then I would:

* make sure CTA is closed

* copy the file c:\program files\chinesetextanalyser\data\cedict_ts.u8.original to c:\program files\chinesetextanalyser\data\cedict_ts.u8 (restoring the main dictionary file to its original state)

* open up your new c:\users\<username>\AppData\Local\ChineseTextAnalyser\data\cedict_ts.u8 file

* copy the entire contents

* paste it at the end of c:\program files\chinesetextanalyser\data\cedict_ts.u8

* save the updated c:\program files\chinesetextanalyser\data\cedict_ts.u8

 

This way the main c:\program files\chinesetextanalyser\data\cedict_ts.u8 is going to be the combined contents of c:\program files\chinesetextanalyser\data\cedict_ts.u8.original and your new definitions from c:\users\<username>\AppData\Local\ChineseTextAnalyser\data\cedict_ts.u8, and you will have separate copies of both your new content and the original dictionary.

 

It might seem like a lot of work to do this, but the benefit will be that a future version of CTA will look to see if c:\users\<username>\AppData\Local\ChineseTextAnalyser\data\cedict_ts.u8 exists, and if so, combine it with the main dictionary file.

 

By having all your new definitions in that file, at some point it will all just start working automatically.

  • Like 1
Link to comment
Share on other sites

Hey, Imron, have been using this to parse 锵锵三人行 scripts daily and export character, pinyin, and cloze sentences. Love the program. 

 

Two quick questions. 

 

One, are there still plans for a Mac version?

 

Two, are there any plans to add the a functionality to colorise the characters in the viewing box by tone (red for 1st, yellow for 2nd ect)? Or would there be anyway to do a hack-around to enable this?

 

Best,

Uni

Link to comment
Share on other sites

One, are there still plans for a Mac version?

Yes.  Still underway, I've just been busy lately and paid work has taken priority.

 

There is currently no way to enable colours per tone either by configuration option or hack.

 

There are two reasons why this is unlikely to happen, firstly from a technical level it means not just segmenting Chinese, but also being able to do some sort of semantic analysis to be able to distinguish between characters that are written the same but have different pronunciation.  This would be partly solved by a more intelligent segmenter (which I hope to get around to writing eventually), however there is also another reason which is more pedagogical - namely that one of the design goals in CTA is to get people to be able to read Chinese content without relying on aids and the only real way to do that is to not provide those aids.

Link to comment
Share on other sites

  • 3 weeks later...

Suggestion: introduce options for spacing between lines - I find 1.5 or double spacing really helps. 

 

Much less important: allow for tables to paste in correctly - these are often included in the articles/policies I'd reading, so currently I need to tab between the original and analysed version whenever I reach one. As noted, much less important but would be nice.

Link to comment
Share on other sites

Probably addressed earlier in this thread, but just in case:

A visual cue (e.g. an underline) that moves at a user-defined pace under the characters, and also scrolls automatically through a document. It doesn't need to properly segment as it does this, just move at a steady pace. When I want to read quickly in a dense text I often find myself tracking with my finger/a pen with improved results. Digitizing this would be great - not sure how you feel about this being a crutch or not.

Link to comment
Share on other sites

Suggestion: introduce options for spacing between lines - I find 1.5 or double spacing really helps.

This should be trivial to add.  It'll be in the next version.

 

A visual cue (e.g. an underline) that moves at a user-defined pace under the characters,

Chinese Text Analyser was originally going to be 'Chinese Speed Reader', and the original purpose was to be a tool that lets you improve your reading speed based on the method I outlined here.  The speed reader needed all of the features of the analyser and then I decided to release the analyser part first as a standalone product.  The speed reader will come later probably as a separate product (but that will work and share vocab with CTA).

Link to comment
Share on other sites

This is a follow up from another thread.

 

 

Yes.  CTA can import new vocabulary - just have a file with one word per line (anything on the line after the first whitespace is ignored, so the line can contain other things after the word also).

In Pleco you can export your cards as a utf8 text file.  Don't include anything extra data with the cards and you'll have a file that contains the headword followed by a tab, followed by the pinyin.

To clarify: I won't be able to export database metrics from Pleco, so after import into CTA, I would have to manual mark words as known/unknown, etc.?

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...