Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

On 3/28/2022 at 3:57 AM, imron said:

The code is definitely there for it and it should all be hooked up.  Not sure why it's not working. 

Weirdest thing: PgUpDn does work in the word statistics view and the word list view on linux, as far as I remember it exactly does not work in those 2 views on Windows. If I remember correctly I could only use PgUpDn in the text view on Windows. 

Link to comment
Share on other sites

  • 2 weeks later...

As a new member, I thought I should go back through the threads I’ve found particularly helpful and mark posts helpful, thanks, good question or like. But, I just can’t do it in this thread. It’s too darn long, and there are too many helpful posts. It took me hours a day for two days in a row to read through this thread.

 

I just want to say that this software is helpful in so many ways from beginner to expert. It seems to me that it should be selling like hotcakes and imron should be rich. But, what do I know? I was never good at marketing in my geek job.

  • Like 2
Link to comment
Share on other sites

On 4/7/2022 at 4:28 PM, imron said:

This is what I think as well :mrgreen: unfortunately it is not the case. 

Has anyone ever offered to buy you out or partner with you? Maybe their marketing branch could get your sales way up, even though they take a hefty cut, like 33% or 50%. I was in a situation like this once when I was in my 20s. But, I thought that I and my partner-friends could do it on our own. I’ve often wondered what it would have been like, if I had agreed to partner with people who had far better marketing capabilities. I ended up chalking it up to my karma being to not make money in a fast way. I'm totally okay with the amount of money I have to live on. Fortunately, geeks can earn decent money to live on a slow way.

Link to comment
Share on other sites

On 4/8/2022 at 8:05 AM, MTH123 said:

geeks can earn decent money to live on a slow way.

For sure. I earn a decent amount, just not from CTA (or any of my other bits of software).   If it brought in more I’d be able to justify spending more time on it. 
 

I’ve had a couple of discussions with people about partnerships, but nothing ever came of them. 

On 4/8/2022 at 9:51 AM, Flickserve said:

Imron is still rich anyway

In non-monetary terms, yes. 

Link to comment
Share on other sites

  • New Members

Thank you for this tool! I'm not ready to tackle my first book, but I'm going to start loading the important words for it into my anki deck now. Like this I can count down until I'm ready to go for it.

 

One question: When I add a custom word, does it remember it across files? The book I'm going to read is in a series, so it would be nice if it remembered the names.

Link to comment
Share on other sites

  • 2 weeks later...

I’ve been toying with the idea of putting a text document through the Stanford Word Segmenter (https://nlp.stanford.edu/software/segmenter.html) and then putting the result through Chinese Text Analyser to maximize Chinese Text Analyser’s capability of generating a list of words with frequencies. Does anyone have any thoughts on this?

Link to comment
Share on other sites

It will likely lead to better segmentation from cta, because cta starts and stops its segmentation algorithm each time it hits spaces or punctuation.  If every word is separated by a space, then CTA will have a higher likelihood of using the correct segmentation - but for longer words, or things like names it might still incorrectly segment within the word.

 

I've actually got a development build of CTA that segments entirely based on spacing, and if passed through the SWS first, would give an exact match on the segmented results.

 

That being said, one of the reasons I've never gotten around to improving the CTA segmentation is that for the most part, the frequency lists are accurate enough to be useful. 

 

Things like names are likely be better detected going with the SWS approach, however for most other words in a text the most frequent words will still be at the top and the least frequent words will still be at the bottom, it's just that the ordering may be slightly different.

 

When generating a frequency list from a text, the difference is not going to be significant enough to worry about as you'll still get more high-frequency words than you'll be able to deal with.  I'd be interested in seeing a comparison though if you do it ?  Contact me via email and I can set you up with the development version of CTA with space segmentation.

 

  • Thanks 1
Link to comment
Share on other sites

On 4/20/2022 at 8:10 PM, imron said:

Contact me via email and I can set you up with the development version of CTA with space segmentation.

 

Thank you so much for your post! But, whoa, most of what you said is too advanced for me. Please give me time to digest it.

Link to comment
Share on other sites

On 4/20/2022 at 8:10 PM, imron said:

Things like names are likely be better detected going with the SWS approach

 

I've found that Google Translate is very good at translating names. So, it's one of the reasons why I use it when I'm translating Chinese subtitles.

Link to comment
Share on other sites

  • 3 weeks later...
On 4/20/2022 at 8:10 PM, imron said:

I'd be interested in seeing a comparison though if you do it ?  Contact me via email and I can set you up with the development version of CTA with space segmentation.

 

Thank you again for the development version of CTA with space segmentation! I have made a comparison involving SWS. I’ve placed the post in a new thread and provided a link here, as you suggested:

 

https://www.chinese-forums.com/forums/topic/62192-chinese-text-analyser-and-stanford-word-segmenter/

 

  • Like 1
Link to comment
Share on other sites

  • 9 months later...
  • 2 months later...

@imron is there a way to export a text file with all the words marked as known? Or could it be as easy as creating a new wordlist based on an existing one, and than that new wordlist file exactly contains only the known words (and not the entire history of added and removed words).

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...