Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

  • 2 months later...
  • 2 months later...

imron, what is the best indicator of the difficulty of a text in CTA, if you've never uploaded a list of your Known Words?

 

Is it the number of unique words/unique characters in the text? The HSK percentages?

Link to comment
Share on other sites

Not the HSK percentages.  They are there mostly to show how the HSK is not that useful :mrgreen:

 

The number of unique words is one potential indicator of difficulty, but I'd also look at the number of words it takes to get to 98% comprehension of the text and see how big a proportion of total words that is, and I'd also look at what percentage comprehension you get if you learnt every word that appeared more than once.

 

That gives you an idea of how many words you'd need to know/learn in order to read the book comfortably.

  • Like 2
Link to comment
Share on other sites

A thought. Dividing the number of unique words by the total number of words gets you the percentage of unique words in a text. Dividing the other way tells you, for example, that 1 in 5 words is unique. Not sure how closely the density of unique words correlates with difficulty though.

 

  • Like 1
Link to comment
Share on other sites

7 hours ago, roddy said:

text-to-speech APIs? I want word lists for TV shows. 

Speech to text you mean?

 

I've actually been toying with writing an application that does this, but for any language, not just Chinese (and by toying I mean I've already written a bunch of code and done test calls with the APIs and got reasonable results back).

 

Still not sure if I have the time to make it though and if there is any demand for this kind of thing, especially as it would need to be a paid service (because Google/Microsoft charge for each API call).

Link to comment
Share on other sites

8 hours ago, murrayjames said:

Not sure how closely the density of unique words correlates with difficulty though.

I think you'd also need to look at the frequency of those unique words in the text as a whole.  If many of those unique words only appeared once or twice in total, but when combined made up a significant percentage of total words, then that would affect difficulty, because it would mean lots of words you need to put in work to learn, but that don't really lead to increased comprehension for the rest of the text.

 

Looking at the words it takes to get 98% comprehension (or some other reasonably high percentage) serves as a decent proxy for that.

Link to comment
Share on other sites

Reinstalling CTA after a hard drive crash. I made a backup of the ChineseTextAnalyser AppData folder before the crash. After reinstalling and running CTA, do I replace the new AppData folder with the old AppData folder to get my known words back?

 

UPDATE: I did and it worked perfectly. The license copied over too!

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...

I'm doing the 14 day trial right now. I think it's a useful piece of software and will probably buy it, but I wish the word segmentation was better since it's a core feature of the app. Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

  • Good question! 1
Link to comment
Share on other sites

5 hours ago, drungood said:

but I wish the word segmentation was better since it's a core feature of the app

Segmentation is always something that I've wanted to improve, and in fact have worked on implementing a bunch of different segmenters but the main issue is one of not having enough time to build something suitable - both in terms of speed, memory usage and correctness.

 

As with everything, there are tradeoffs.  Most of the problems can be solved, it's just that there's a large amount of work involved and it only returns a minor increase in correctness, and so when I have time to work on CTA it usually goes towards other features because the segmenter is ball-park level correct, and that is sufficient for what I see as the main features of the app:

 

1. Finding frequently occurring unknown words in a piece of text.

2. Comparing texts to see the relative difficulties.

 

Based on tests I've done, and on my own experience, improving the segmenter doesn't have a significant improvement on those two activities.

 

The current segmenter does mean that CTA is less useful if you are wanting to use it for precise segmentation on a sentence by sentence level.

 

6 hours ago, drungood said:

Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

Solving for the general case is not that bad, it's the edge cases where things fall down.  E.g. what happens if the file is several GBs?  Most tools lockup.

 

CTA on the other hand will open (and highlight text) instantly and let you scroll anywhere through the file (though statistics take a bit longer to generate).

 

A GB of text might seem a bit extreme, but that's only about 1000 books, which is not unreasonable if generating information for a corpus or similar.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...