Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

18 hours ago, imron said:
On 2/25/2020 at 6:36 AM, Pall said:

It would be fine to add also a function that can mark characters from HSK1, 2, 3, 4, 5, 6

I get that people are interested in things like this, but CTA aims to subtly push people away from thinking in terms of HSK.  In fact I only provide the HSK statistics to drive home the point that for most native content, the vocabulary for the HSK doesn't give you very much at all, and you're better off using frequently occurring words in what you are reading.

 

On 3/25/2020 at 10:24 AM, Pall said:

It would be especially great if CTA could also mark and counter 'head' characters even though they're of conditional nature.

I had a look through your link, but I'm still not entirely sure what you mean by head characters.

I understand your point. It's true, HSK5 is not enough even for reading newspapers. But the idea is to learn some basis exlusively well, to be able to feel and observe  it in one's mind, and that'll make things much easier when encountering a new character, since it can be fit in the firm HSK5 framework. As to me, basing on three first poems, for B-P-M-F, D-T-N-L and Z-C-S , I've learnt all characters from HSK5  very confidently. However, let's asume I may doubt sometimes if I know a new character, which pinyin I looked up, and it happened to be one of the learnt syllables. I check it in the Table and... (1) see it's there. I'm sure now it's very unlikly that next time I'll hesitate to recognize it. (2) It's not there. OK, I just add it to a certain card and cell in the Table marking it in green. In this case it's also very likely that I'll memorize it much quicker compared to the situation when there is no firm basis in the form of the Table and the cards (two types, 'intermediate poem presentation cards' and 'head character cards').

Head characters are just some characters selected to represent an entire syllable with respect to tone. For instance, in the HSK5 there are three characters sounding fáng:房,防,妨。 We take one of them as a representative. I selected 房 as such. For 'fang' in the other tones head characters are also selected. For 1st tone  it's 方 , for 3rd 访, and for 4th  放 . They're all 'head' characters. We select also one of their meanings to use in formulas (it concernes all characters): the 'corner 'for 方, 'building' for 房 (though 'flat' might be more often meaning), 'explore' for 访 and 'advertise' for 放。 Head characters are used in 'head character cards', on one side of which there is the head character, and other characters of the same pinyin are on the reverse, the latter being arranged in the special order for better memorization, see picture (at the bottom of it). In the 'head character cards' other characters are linked to 'head' ones by a 'horizontal' formula, a phrase connecting their meanings one after the other (in the same word order), with the use of other necessary words, of course. The meanings of the characters are 'target' words. 

 

The possible number of the head characters is about 1200-1300 for the whole language, and within HSK5 it's 880 (just to begin with ). But the number of head character cards required may be much less, for hundreds of syllables are represented by a single character while others by a number of characters of the same pinyin. For example, in the HSK5 we need only about 350 head character cards.

 

Then one of the 'head' characters is selected as the 'key' character  to represent one of the 400 syllables without considering for tones. I chose 方。 The 'key' characters, their meaning (one of) is used in 'poems' composed according to 声母 vs 韵母 correspondence. All head characters, including key characters, are presented in 'intermediate poem presentation' cards, see pic (at the top of it). Head characters of the same syllable (of different tones) are linked to one another by 'vertical' formulas.  I managed to compose these formulas  in English.

 

Thus, one has to learn only 120 'intermediate poem presentation cards' and some hundreds of 'head character cards' (for HSK5 only 350) to know all characters according to his level. In the Dictionary of Contemporary Mandarin which I have 20,000 words are based on only 4,500 characters. So, starting from the HSK5 basis one can move to the objective of 4,500 by adding new characters marked in green. 

IMG_0366.jpg

Link to comment
Share on other sites

16 hours ago, roddy said:

Did you ever know that you're my hero
And everything I would like to be?
I can fly higher than an eagle
For you are Imron beneath my wings

 

Good poetry! I'm sure you could manage to compose in English even long 'horizontal' formulas linking a number of characters in a given order.

 

Link to comment
Share on other sites

There isn't a way to export them, but there is a way to access them.

 

Assuming you are on Windows you can open file explorer and go to:

 

C:\Users\USERNAME\AppData\Local\ChineseTextAnalyser\wordlists\cache

 

Where USERNAME should be replaced with your computer username.

 

Each file in that directory corresponds to each wordlist, and will be a .txt file with one word per line.

 

NOTE: You should copy/open these files after closing CTA as recent changes might not have been written out to disk.  Also note, these are not the actual saved wordlists themselves, just a cached copy that CTA uses so it doesn't have to rebuild the full list from the actual saved format.  If you edit these files, those changes will be overwritten when CTA detects changes have been made and recreates the cached copy.

  • Helpful 1
Link to comment
Share on other sites

  • 2 weeks later...
  • 1 month later...

Imron, question for you. Today I updated CTA to the latest version (0.99.18). After updating, I noticed that the known word % of texts I was reading had fallen 0.50–1.00%. Any idea why this happened?

  • Good question! 1
Link to comment
Share on other sites

  • 2 weeks later...
On 11/2/2015 at 9:16 AM, Geiko said:

@Imron: according to CTA, my current known vocabulary is at 11912 words, but I've been analysing both simplified and traditional texts, and if I'm not wrong traditional characters and their simplified counterparts are counted as different words, so the real figure must be lower.

 

On 11/2/2015 at 9:39 AM, imron said:

That's correct, because not all of them will be easily guessable if you know the other.

From a few years back...

 

I'm wondering what the best way to handle this is, on the assumption that you know both character sets well and are happy to treat them as interchangeable - ie, if I mark 个 as known, I don't want 個 turning up in an unknown list, and vice versa. Off the top of my head...

1) Decide on one character set to use CTA with and convert to that before feeding any text in. FWIW, I find MS Word's Trad>Simp conversion very reliable. (actually, looking at it more carefully now, I'm changing my mind on this. Plenty of mistakes you can easily skim over as they're similar enough, but not as good as I thought). You could then back-convert exported unknown word lists if you wanted (although I'm less sure on MS Word's Simp>Trad conversion, and at that point it's working with a list of words and won't have so much context to go on. Not sure what difference that makes. Conversion issues aside, this seems most elegant as you don't have 'duplicate' entries.

2)  Every time you switch character sets, take your known word list, convert it, paste in, mark all as known. Again, possible conversion issues and I'm not keen on what it does to vocabulary size. 

3) Manually add as you go along. This seems least efficient.

 

Would appreciate any 前车之鉴。

 

I really enjoy using CTA. Not sure how much demand there'd be, but if you ever thought of a Pro / Advanced version with some extra features, I'd be on board.

  • Like 2
Link to comment
Share on other sites

Advanced or pro version should include counting characters, too. Words are specific. If one's learnt all words in a book, in amount 5,000, for example, it doesn't mean that another book will be based on the same words, many new appear while some disappear. But if one'slearnt 5,000 characters, he can expect that he'll come across another characters very rearly.

Link to comment
Share on other sites

3 hours ago, roddy said:

Not sure how much demand there'd be, but if you ever thought of a Pro / Advanced version with some extra features, I'd be on board.

Plenty of features I'd love to add, but not enough time to add them at the expense of paying work.

 

3 hours ago, roddy said:

ie, if I mark 个 as known, I don't want 個 turning up in an unknown list, and vice versa.

These are different enough that you might not know them.  It's all very well and good to know both sets "well enough" but it's in the margins where this will make the difference.  Perhaps a feature that lets you mark simplified/traditional variants as known, but one that needs to be run manually rather than automatically.

 

3 hours ago, roddy said:

Manually add as you go along. This seems least efficient.

It also involves the most amount of work and therefore the most learning.

Link to comment
Share on other sites

@imron, I'm a bit embarrassed to admit that since I first downloaded CTA I never updated it, my version is 0.99.9 and now I wanted to update it to the latest version but I can't find how on the website. How should I do it?

Link to comment
Share on other sites

  • 2 weeks later...

Hello Imron, on MacOS, CTA is not working correctly when I mark a line-breaking word as correct.

 

An example: The word 盛行 spans two lines (i.e., 盛 is the last character of one line; 行 is the first character of the next line). If I place my cursor on 盛 and mark the word 盛行 as known, CTA works as expected. If I place my cursor on 行 and the mark the word 盛行 as known, 盛行 remains unknown, and the character 行 becomes unknown.

Link to comment
Share on other sites

  • 1 month later...

Two suggestions:

  1. Keep a running corpus of all recent articles. By that, I mean save and cumulative sum the word lists from previously opened documents, which then can be used as an alternative sorting for vocabulary in the current document. I do this a cumbersome way now (save all articles read each week to Evernote, export to a single file, then do a single CTA scour for most frequently hit unknown words in that group), but building this into CTA would be great. Make it easy to reset this local corpus frequency count, so I can reset weekly, monthly, etc as I deem fit.
  2. Timed movement of highlight along characters. My current habit is to load a document, mark it entirely known, click the right arrow quickly (or, hold it down) as I read through and article, and then hit 'd' to pop up a definition as necessary. I'll probably get some form of repetitive stress syndrome from all that clicking (once per word!). I'd prefer if I could set a timer for 0.x seconds per character, with CTA then moving at that set pace (and maybe make left/right arrow increase or decrease the rate). Hitting 'd' once pauses for a definition, hitting again moves on.
  • Like 2
Link to comment
Share on other sites

One more request:

  1. Full screen mode is more like "focus" mode. Right now full screen mode essentially maximizes the window, but doesn't change how it looks - menus remain, etc. Instead, it should hide all menus (top, right bar), and status bar on the bottom, with just the body text visible and centered - similar to how an ebook looks, or reader mode in Firefox, when full screen.
  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...