Jump to content
Chinese-Forums
  • Sign Up

CTA for other languages


sannomiya

Recommended Posts

Hello, I was looking at some of the threads regarding the Chinese Text Analyzer and it looks like exactly what I needed, especially the ability to handle large texts. I know it is designed for Chinese, but I was wondering if it could also handle other foreign language texts, or if there was a slimmed down version that could handle other languages. I normally use FLTR for non Chinese use, but I very much like what was done with CTA. Also, I think I saw something regarding a companion software for text reading; will that be released soon?

 

Thank you.

  • Like 1
Link to comment
Share on other sites

It can be used for other languages, but it depends on the language - and different languages will have different degrees of success with it.

 

If the language uses spaces to separate words, then it will more or less work as is (though you need to enable a setting in the config file so that it doesn't ignore non-Chinese words).

 

If the language doesn't use spaces to separate words then you will need to generate (or provide) a list of all (or very many) words from the language.  The segmenter will then process each sentence matching the longest word it can from that list, before moving to the next word (and so on until the end of the sentence).  You can place this word list in:  c:\users\<username>\AppData\Local\ChineseTextAnalyser\data\words.u8 on Windows, or ~/Application Support/ChineseTextAnalyser/data/words.u8 on OS X, and CTA will use that for processing, and while it won't do a perfect job of segmenting, it should be acceptable.

 

If the language uses a complex script, such as right-left text or similar, then CTA will not work very well at all (though I hope to eventually fix that in the future).

 

Finally, in all of the above situations, if you want dictionary definitions, then you'll need to create a dictionary file in CC-CEDICT format and place it in CTA's data directory (see here for more info).

 

The companion reader is going to be a speed reader, though unfortunately it will be a while before that is released unless there is a sudden influx of people purchasing CTA, allowing me to justify spending more development time on it.

  • Like 1
Link to comment
Share on other sites

Hi Imron,

 

Many thanks. Fortunately, the language I have in mind is Korean, which does use spaces.  One more question if I may; is it possible to edit the segmentation or string in the program? 

 

To use English as an example, could I edit or tag 'continues', 'continued', 'continuing', etc. so that the program will recognize them all as conjugations of 'to continue'? Or to edit/tag a string of words so that the program recognizes a single phrase rather than the individual words making up that phrase?

 

Thank you.

Link to comment
Share on other sites

To use English as an example, could I edit or tag 'continues', 'continued', 'continuing', etc. so that the program will recognize them all as conjugations of 'to continue'?

This is not currently possible.

 

Or to edit/tag a string of words so that the program recognizes a single phrase rather than the individual words making up that phrase?

To do this, select the entire text you want to recognise as a single word, right-click and choose 'Add custom word'.

Link to comment
Share on other sites

Many thanks. Fortunately, the language I have in mind is Korean, which does use spaces

Also, let me know if you run in to any issues trying to get Korean working (replying in this thread is fine) as I can probably make minor changes to work around small, currently unforeseen problems.

Link to comment
Share on other sites

So I was just exploring the features of CTA and was very excited about how well it handled Chinese texts (thoughts on use for Korean below).  I only pasted a few paragraphs into the clipboard, but the results were instantaneous.  That right there convinced me that I should get a license. I also like the simplicity of it. I like how you can only mark a word as known or unknown. Other programs such as LWT or FLTR have options for degrees of understanding, but I don't find them very useful and somewhat distracting with the number of different colors on a document.

 

A couple of things I noticed, which may already have been addressed (i only read the posts relating to the OSX release, so unsure if these were intentional or on the to do list):

 

1. it appears that CTA won't save cut/pasted text, either when you shut down or by the edit menu

 

2. CTA doesn't appear to recognize trackpad gestures (i.e. two-finger tap for right click options)

 

3. it doesn't seem to allow editing of anything cut/pasted

 

4. i wasn't able to figure out how to open a new clipboard without exiting the program first

 

 

For Korean, I did some testing with a bit of text cut/pasted from a news site:

 

1. CTA didn't immediately recognize korean; i had to play around with the font options (interestingly, i couldn't use the arrow keys to just scroll through them quickly; i had to click on each to check if it would work)

 

2. despite having spaces in the text, CTA did not recognize anything.

 

 

Final thoughts:

 

This is a great product for Chinese and I'm sure that as development continues, it will only get better. I am certainly going to purchase my license. But what I kept thinking was that although it is geared for Chinese, many learners, regardless of language, would love this (assuming it will recognize the relevant font), especially with the difficulties of LWT and with FLTR no longer being updated. Similar to how Anki was originally meant for Japanese, but is now used for all sorts of uses (one of the reasons why I prefer Anki over Pleco for flashcards, despite Pleco having superior flashcard features for Chinese). If you ever do go that route, may i suggest the following:

 

1. in order to accommodate conjugations, perhaps allow for a particular word to be edited so that it exports to a word list in another form, e.g. if the text says 'continuing', then allow the user to edit it so that it exports the word 'to continue'. 

 

2. rather than being responsible for specific dictionary files for each individual language, perhaps allow a window somewhere that will link to an online dictionary of the user's choice.

 

In any case, thank you for your work on this.  Great product.

 

Cheers.

  • Like 2
Link to comment
Share on other sites

Thanks for the detailed feedback.

 

I only pasted a few paragraphs into the clipboard, but the results were instantaneous

Wait until you try a full novel - the results are almost as instantaneous.  I put a lot of work in to this aspect, and CTA has been designed from the ground up with performance in mind.

 

I agree with you about degrees of understanding.  In the context of reading, when you get right down to it, you either know a word or you do not.  The degree to which you don't know it is not as important as the fact that your reading process was interrupted.  CTA is designed to highlight and make you aware of words that interrupted your reading process.

 

With regards to your points:

 

1. it appears that CTA won't save cut/pasted text, either when you shut down or by the edit menu

It does in the Windows version, but this was one of the features that didn't make the cut for the initial OS X release - I wanted to get the OS X version out there and it would have take much longer if I'd waited until it reached full feature parity with the windows version.  It's on my list of things to do.

 

2. CTA doesn't appear to recognize trackpad gestures (i.e. two-finger tap for right click options)

Can you provide more details on this?  It works fine for me on my Macbook Air.

 

3. it doesn't seem to allow editing of anything cut/pasted

This is by design and allows for performance improvement in the segmenting and analysing stages.  At some point I'd like to release a 'graded text' editor based on some of the CTA tech but it would be a time sink to get the performance where I want it so, it's still far off in the future.

 

4. i wasn't able to figure out how to open a new clipboard without exiting the program first

I'm not quite sure what you mean by 'opan a new clipboard'.  Can you provide more detail here on what you are trying to do?

 

i had to play around with the font options (interestingly, i couldn't use the arrow keys to just scroll through them quickly; i had to click on each to check if it would work

When the font window is shown, it doesn't have the keyboard focus.  If you click on the font window title bar to give it focus (the traffic light buttons on the top left will become coloured instead of greyed out) then you should be able to use the arrow keys to cycle through fonts.

 

despite having spaces in the text, CTA did not recognize anything.

Although I mentioned editing the config file, I actually forgot to mention the config setting - if you open ~/Library/Application Support/ChineseTextAnalyser/data/config then in the [general] section there is an option called 'chineseOnly' which defaults to true.  You'll need to set it to false otherwise CTA ignores any character not in the unicode blocks assigned to CJK.

 

Unfortunately, a quick check shows that it still won't work for Korean words as it will only combine consecutive ascii characters in to words.  I will make some changes to this (probably in the next release) which will make it more supportive of other options.

 

1. in order to accommodate conjugations, perhaps allow for a particular word to be edited so that it exports to a word list in another form, e.g. if the text says 'continuing', then allow the user to edit it so that it exports the word 'to continue'.

I'll look in to it.  There are various word stemming lists available for various languages so I'll look to see how they could be incorporated.

 

2. rather than being responsible for specific dictionary files for each individual language, perhaps allow a window somewhere that will link to an online dictionary of the user's choice.

This is already a planned feature.  It still doesn't help when exporting wordlists though because if the user is exporting a large list, I don't want CTA to be hitting a server with several hundred requests a second :-)

Link to comment
Share on other sites

Sounds great.  Glad to here that most of these were already on your mind. Below are a few followups to your comments (ps: how do you get rid of a accidentally inserted quote box in a topic reply?).

 

Cheers.

 

Quote

2. CTA doesn't appear to recognize trackpad gestures (i.e. two-finger tap for right click options)

Can you provide more details on this?  It works fine for me on my Macbook Air.

 

 

I was trying to do a two-finger tap on the trackpad to bring up a right click menu on the main window, but nothing happened. I was trying to paste my copied text. I don't think I pasted enough for a two-finger scroll. I'll try that when I get back.

 

 

 

3. it doesn't seem to allow editing of anything cut/pasted

This is by design and allows for performance improvement in the segmenting and analysing stages.  At some point I'd like to release a 'graded text' editor based on some of the CTA tech but it would be a time sink to get the performance where I want it so, it's still far off in the future.

 

 

Ah, I see. What brought that to my attention was I cut/pasted an article from a news site and it had a photo and photo caption.  i was trying to delete those from the main text.  Certainly not a biggie since I can edit a txt file and import that way.

 

 

4. i wasn't able to figure out how to open a new clipboard without exiting the program first

I'm not quite sure what you mean by 'opan a new clipboard'.  Can you provide more detail here on what you are trying to do?

 

 

Maybe the use of "clipboard" is not accurate. I just saw a tab open up that was titled clipboard. i was trying to open up another window or tab to paste more text.  No specific purpose in mind, just playing with the program. I couldn't delete what I'd pasted in the original window and I couldn't open a new one, so I just exited the program to get back to a clean slate. The tab i saw earlier popped up when I pressed an option to "view list" somewhere.

 

Quote

i had to play around with the font options (interestingly, i couldn't use the arrow keys to just scroll through them quickly; i had to click on each to check if it would work

When the font window is shown, it doesn't have the keyboard focus.  If you click on the font window title bar to give it focus (the traffic light buttons on the top left will become coloured instead of greyed out) then you should be able to use the arrow keys to cycle through fonts.

 

 

Hmm.  i thought I did.  I'll go back and try again.

  • Like 1
Link to comment
Share on other sites

I was trying to do a two-finger tap on the trackpad to bring up a right click menu on the main window, but nothing happened.

The two fingered tap will only bring up a right-click context menu if a document is currently open.  There is (currently) no context menu if there are no documents open.  This might be the cause for your issue.  Note, even then, there is no 'paste' option in the context menu because you can't 'Paste' in to the current document - paste will always paste in to a new document.  Personally, I just use ⌘V to paste on OS X, otherwise you'll need to use the Edit->Paste menu option.

 

I couldn't delete what I'd pasted in the original window

As mentioned previously, all documents are read-only.  No editing/altering is currently possible.

 

i was trying to open up another window or tab to paste more text.

Just ⌘V or Edit->Paste.  As mentioned above, it will always paste the contents of the clipboard in to a new document.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...