Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

  • 5 weeks later...
  • New Members

Three feature requests for the next version:

 

1. I'd like the ability to drop a .txt file onto the dock icon/window and have it open and I'd like for it to tell the OS that it can read .txt files so I can right click->open with->CTA. Right now, the File->Open menu is the only way to load a text file and it's clunky compared to drag and drop.

 

2. After adding a custom word and then clicking "Show Definition", the popup just says "no definition". Instead, can it show the pinyin of the word? I add names as custom words and then I forget how to pronounce them. Before making it a custom word, I could click on each letter and sound it out but once it's turned into a custom word, I need to copy and paste it into Google Translate just to get the Pinyin.

 

3. When using flux, the colors are really hard to tell apart. In the light scheme, the color for looked up words looks almost identical to known words and in dark mode, the looked up words are almost invisible. 

 

Looking forward to the next version!

  • Like 1
Link to comment
Share on other sites

1 is easy and can be in the next version

 

2 I'll add to my todo list

 

3 You can already change, but you have to edit a text file:

  windows: /Users/<username>/AppData/Local/ChineseTextAnalyser/colour-schemes/default.colours

  macos: /Users/<username>/Library/Application Support/ChineseTextAnalyser/colour-schemes/default.colours

  linux: /home/<username>/.local/share/ChineseTextAnalyser/colour-schemes/default.colours

 

The file is just a list of key=value pairs. The keys should be fairly self-explanatory and the values are hex rgb values but without the # sign at the front.

  • Like 1
Link to comment
Share on other sites

On 7/25/2017 at 0:50 PM, Yadang said:

I just checked it again on Linux with a document of 5 unknown words to make sure I wasn't just getting lost in the sea of unknown words, and it didn't work. I'm pretty sure it doesn't on Windows either.

@Yadang I just checked this, and it works correctly on windows, linux and macos.

 

What version of Linux are you running?

Link to comment
Share on other sites

On 8/24/2017 at 8:37 AM, imron said:

What version of Linux are you running?

 

lsb_release -a gives me:

 

Distributor ID:    Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:    trusty

 

Is that what you're looking for? Note that I'm running it on a chromebook with crouton. A lot of things don't work the way they're supposed to. As for windows, I was using windows xp when I encountered the problem.

Link to comment
Share on other sites

  • 2 weeks later...

The ChineseTextAnalyser data directory can be found here:

 

Windows: c:\users\<username>\AppData\Local\ChineseTextAnalyser\

macOS: /Users/<username>/Library/Application Support/ChineseTextAnalyser/

Linux: /home/<username>/.local/share/ChineseTextAnalyser/

 

And you can just copy the whole thing to the same location on the new computer.  If you are changing operating systems, some config options such as remembering size and positions of windows will not be preserved.

 

You can also just copy specific sub-directories within that directory e.g. wordlists or colour-schemes to get just those things.  Custom words you have specified can be found in data/words.u8

  • Like 1
Link to comment
Share on other sites

  • 3 months later...

Might have asked this before, but can't find it... Is there a way to import words and only make them be added as custom words, but not marked as known? Or even - if a list of words is added with the import feature, are they added as custom words if there's no dictionary entry that matches? If so, could I just import them by list, then copy and paste the document and remark them all as unknown? 

Link to comment
Share on other sites

There's no way to do this from the user interface, but you can do it by manually editing files.

 

1.  Close CTA if it is already open

2.  Go to the CTA data directory (macOS: ~/Library/Application Support/ChineseTextAnalyser/data/, windows: C:\Users\<username>\AppData\Local\ChineseTextAnalyser\data\, linux: ~/.local/share/ChineseTextAnalyser/data/)

3.  Open the file called words.u8 (or create it if it doesn't exist.  This should be a plain text file in utf-8 format)

4.  Paste custom words to the end of the file - one word per line

5.  Save the file and close

6.  Re-open CTA and enjoy all your custom words not yet marked as known.

  • Thanks 1
Link to comment
Share on other sites

  • 3 weeks later...

Hi. I read somewhere that a new version was going to have a more accurate segmenter. Is this the case now?

 

Also, what if some words I already know (and which I import as a wordlist) are not in the CEDICT dictionary? Will CTA fail to recognize them since they're not in the dictionary?

 

Thanks!

  • Good question! 1
Link to comment
Share on other sites

You read it right in this thread from a previous post of mine.  I was actively working on it and the results weren't as good as expected because the statistical information it relied on would overmatch words, and a large amount of those overmatched words didn't exist in the dictionary so doing a dictionary lookup on many words would just result in a 'no definition' definition.

 

There are a number of ways to solve that problem but, I got caught up with a bunch of other work and haven't gotten around to doing that yet.

 

Words that you have added as custom words will still be matched by the newer segmenter, CTA will look at the words you've added and give them a statistical bias.

  • Helpful 2
Link to comment
Share on other sites

I've just started with CTA - looks great so far. Due to my dodgy colour vision, the colours for known and unknown words in the text view are almost indistinguishable to me. In your post of 24 Aug 2017 you said the colours could be changed by editing this file:

  windows: /Users/<username>/AppData/Local/ChineseTextAnalyser/colour-schemes/default.colours

I can't find that file. The only folders in AppData/Local/ChineseTextAnalyser are clipboard, data, logs and wordlists. Is it still possible to change the colours in the light scheme? Some colours are much easier for me to distinguish than others.

 

Thanks

  • Good question! 1
Link to comment
Share on other sites

What version of CTA are you using?

 

That file should still be there, but it might not be created on disk until you run CTA for the first time and then quit the program. 

 

When you come up with a suitable set of colours can you let me know and I'll include them in the main program for other people who face similar issues. 

Link to comment
Share on other sites

Thanks. As you said, the file appeared after I closed and reopened CTA.

 

After some experiments I found that #0072BC (RGB 0, 114, 188) worked well for unknown.foreground.

For me it is easily distinguishable from the colours for known words and hover/looked up words, but still dark enough to read easily.

 

So far the other colours seem fine. If I have any more issues I will let you know and suggest alternatives. BTW my kind of red-green colour blindness is one of the most common types, so if you are able to cover that in the main program as you said, I'm sure that would be very helpful for others.

 

I'm using version 0.99.16 - 64 bit, which I recently downloaded.

 

This program is definitely going to be very helpful. Thanks again.

  • Helpful 1
Link to comment
Share on other sites

  • 1 month later...
  • 6 months later...

@imron, is there any update on improving word segmentation in CTA? I recall you said something a couple of years ago about planning to improve it from current somewhat "hit and miss" state. I try using it about once a month and put it back with a sigh, after seeing "unknown words" it produces. An example from today: 

 

没有最大限度地利 用已有的资源。
is segmented as 没有/最/大限/度/地利/用/已/有的/资源。:wall

 

Asking as a paying customer who resorted to running a Windows VM in order to use a better-working (and free btw) software. There are free(!) public(!) segmenters on Github IIRC ready for copy pasting or at least re-implementing in the language of your choice.

Link to comment
Share on other sites

1 hour ago, uvwxyz said:

There are free(!) public(!) segmenters on Github IIRC ready for copy pasting

That are not as fast.

 

1 hour ago, uvwxyz said:

or at least re-implementing in the language of your choice.

Which I did, and got it to a working state with acceptable performance, but found that the results were also just as much hit and miss because the segmenter was overbroad - meaning it would rate things as words that were really phrases, and that then had an impact on looking things up in the dictionary because many of the hits were on non-dictionary words that returned no results.

 

There are ways to fix this, and I have investigated some of them, and then life and work got in the way and I haven't had time to get back in to things.

 

2 hours ago, uvwxyz said:

Asking as a paying customer who resorted to running a Windows VM in order to use a better-working (and free btw) software

If you don't mind me asking, which software?

 

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...