Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
imron

Introducing Chinese Text Analyser

Recommended Posts

Yadang
On 6/26/2019 at 12:17 AM, murrayjames said:

what is the best indicator of the difficulty of a text in CTA, if you've never uploaded a list of your Known Words?

 

On 6/26/2019 at 2:35 AM, imron said:

The number of unique words is one potential indicator of difficulty, but I'd also look at the number of words it takes to get to 98% comprehension of the text and see how big a proportion of total words that is, and I'd also look at what percentage comprehension you get if you learnt every word that appeared more than once.

 

This is what I do. In addition, I look at the bottom few words on that list (the list of words I'd have to learn to get to x% comprehension - I use 95%) and how often they appear in the text. If they don't appear at least 3 or 4 times in the text, I consider the text to hard. If I only see a word that I learned once in text, I probably am not going to remember it, and so it's not as worth learning. Basically because of what Imron says:

 

On 6/26/2019 at 5:57 PM, imron said:

If many of those unique words only appeared once or twice in total, but when combined made up a significant percentage of total words, then that would affect difficulty, because it would mean lots of words you need to put in work to learn, but that don't really lead to increased comprehension for the rest of the text.

 

 

 

22 hours ago, drungood said:

Shouldn't it be possible to segment a txt file with a superior but slower segmentation library, save the segmented version, and have CTA use that?

It is possible, and this is what I do. The reason is, as Imron said, CTA's native segmentation is perfectly good for comparing texts and finding my next text to read, but less good for segmentation on a sentence by sentence level, which is what I need to create cloze deletion flashcards of unknown words with their corresponding sentences.

 

I use the Stanford Word Segmenter described in this post. It segments the words by spaces, so it's perfectly compatible with CTA. After I export the cloze sentences, I use excel to remove the spaces.

 

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

imron

I've just released a new version of CTA.

 

This is more a maintenance release rather than a big new feature release.  The two main thing it adds are:

 

1. Fixing a crash bug in macOS if the document contained characters that didn't exist in the current font, and

2. High DPI support for windows (in both single and multi-monitor setups).

 

It also adds a bunch of minor bug fixes, plus minor new features such as drag-and-drop for opening files, and keyboard navigation - both with arrow keys and vi `hjkl` keys.  With keyboard navigation, you can also press `d` to show the definition of a word.

 

The full release notes are here.

  • Like 1

Share this post


Link to post
Share on other sites
dougwar

Hi 

is there a way to export a work with chinese definition?
I tried to find a way but without success

 

Thanks.

Share this post


Link to post
Share on other sites
imron

Not easily, but it's something on my list of things to do.

 

By not easily, I mean currently you'd need to have a CEDICT formatted file containing Chinese definitions instead of English ones.  If you are able to create such a file, then it making CTA use it is quite easy.

Share this post


Link to post
Share on other sites
murrayjames

Hello @imron,

 

Running 0.99.17. Today I noticed a segmentation problem I hadn't seen before. When the characters in a word are split across two lines, CTA treats these characters as separate words.

 

An example: The word 二流子 is split across two lines, with 二 as the last character of one line and 流子 as the first characters of the next line. When I mouseover the character 二, the whole word 二流子 is highlighted, as expected. If I right-click 二 and select Show Definition, CTA should show me the definition for 二流子. Instead, CTA shows me the definition for 二, then marks the character 二 as unknown. (See attached picture.)

 

This seems to happen only when a word is split across two lines. When a word fits within a line, CTA segments correctly.

 

image.thumb.png.9d000527defab447a8329994bb4fd77a.png

Share this post


Link to post
Share on other sites
imron

Thanks.  I've had a couple of other people mention this also, and it's on my list of things to fix.  It only appears to affect dictionary definitions when the word is split across multiple lines.

Share this post


Link to post
Share on other sites
Jan Finster

@Imron: I like CTA a lot and use it to analyse new texts.

 

I know that CTA tells me what % of words in my text are HSK 3,4,5,6, etc.

 

I wonder if there is a way for CTA to tell me what % of the 1300 HSK 5 words (or the 2500 HSK1-5 or the 5000 HSK1-6) words are covered in the text I copy/pasted into CTA? In my eyes this could be useful for selecting suitable texts to study for HSK levels. If I knew this, I could create a "reading list" that would cover all HSK5 vocabulary...

 

 

 

Share this post


Link to post
Share on other sites
dougwar

Make a know profile only with 5 hsk and see the % of know

  • Helpful 2

Share this post


Link to post
Share on other sites
PerpetualChange

Having studied formally a few years, I don't know if there's anything (course work or otherwise) that can get you to the level of 98% on a normal novel. The best I've gotten is 90% on some 三毛, and the majority of unknown words are not covered on the HSK or otherwise. 

Share this post


Link to post
Share on other sites
imron
12 hours ago, Jan Finster said:

If I knew this, I could create a "reading list" that would cover all HSK5 vocabulary...

@dougwar's suggestion is what you are looking for, however based on my experience, you won't really find many real-world texts that are suitable.  The HSK goes for breadth rather than depth, and HSK 5 will only get you 50-70% comprehension on general native texts, and HSK 6 also falls in to the same range (only gives you a few extra percentage points of comprehension vs HSK 6).

 

This is partly the problem that CTA was designed to solve - it helps you figure out the most relevant vocab to learn in a given piece of text, which is a far better use of time than learning words from HSK lists (see here for some figures on how that plays out).

 

6 hours ago, PerpetualChange said:

I don't know if there's anything (course work or otherwise) that can get you to the level of 98% on a normal novel.

Regular reading of novels is the only thing that will do it.  Train what you want to learn.

  • Like 2

Share this post


Link to post
Share on other sites
realmayo

Could you please remind me - or point me to the right post - about adding words to the dictionary?

Is my memory correct that: CTA knows that two or more characters form a 'word' by looking for them in the "words" file? So, years ago before 特朗普  was a thing, if I wanted it to be recognised as a name, I could simply add it to the "words" file? And then, optionally, if I want to add a definition, I need to modify the cedict_ts file?

Share this post


Link to post
Share on other sites
imron
1 hour ago, realmayo said:

Could you please remind me - or point me to the right post - about adding words to the dictionary?

Here was my initial post (before I implemented the feature) about how it would work.   Then here is a post confirming how it worked once that feature was finished.

 

Let me know if you run in to any problems, or if anything's not clear, and I'll provide more detail on what you need to do.

Share this post


Link to post
Share on other sites
realmayo

Thanks Imron. One other thing: if for example I wanted to teach CTA what 朋友们 meant, am I right I'd need to add 朋友们 to the "words" file too? Otherwise it wouldn't be segmented as a single word but would instead be recognised as 朋友 + 们.

 

Share this post


Link to post
Share on other sites
murrayjames

Hey imron, a Chinese co-worker saw Chinese Text Analyser and asked me where she could buy English Text Analyser. This has happened to me more than once. Just FYI

 

$$$

  • Like 4

Share this post


Link to post
Share on other sites
imron
22 hours ago, murrayjames said:

Just FYI

Yep.  A general text analyser is one of about a dozen ideas I have that I'd like to make at some point.  It's just a matter of finding the time (which I don't really have at the moment).

 

22 hours ago, murrayjames said:

$$$

Also $$$ to develop.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...