Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

8 hours ago, imron said:

Stray cats are a continual problem with unix one-liners

Lol, yes, that should be classic alt.unix.humor. It's my old, perhaps lazy, habit of thinking first "what file(s) do I want to work on?" and then "what do I want to do with it/them?"

Link to comment
Share on other sites

9 hours ago, mungouk said:

some good descriptions of use-cases and tutorials

Assuming you have a halfway accurate list of your known vocab to work against, it's quite easy to identify unknown high-frequency vocab items. Learn those, or keep 'em on a handy list as you read, and then anything else unknown you come up again, you know isn't going to crop up again too often. 

 

You can also use it to identify vocab items that CTA (or CEDICT, rather) isn't spotting. If you see single characters like 罗 and 伯 cropping up at high frequency, there's a chance there's a 罗伯特 in your story. You can save that as a vocab item as you want - it could be a city name or any transliteration, or a Chinese name. 

 

All of the lists generated can be exported for import into Pleco or elsewhere. Say export the 300 most frequent unknown words in a chapter, so you can do some vocab work before actually reading. 

  • Helpful 2
Link to comment
Share on other sites

On ‎8‎/‎14‎/‎2020 at 12:39 AM, mungouk said:

do any of you power-users feel like explaining how you use it to do things you couldn't do with other tools?

 

@mungouk, my use is by no means innovative, but what I do is:

  1. Search for different Chinese novels in txt form (or convert into it).
  2. Analyse them, see how hard they are, pick one from the "not so hard" range.
  3. Add the most frequent unknown words to my Pleco flashcards, until I reach a 95% of comprehension (98% is better, but sometimes I'm too impatient), and study them.
  4. Read the novel without worrying too much about unknown words, only looking up those words that feel important for the plot.

 

I've been doing this for years, it was my best investment in Chinese learning.

  • Like 2
Link to comment
Share on other sites

I’ve been holding back from commenting because I want to see how others use CTA, but the use case mentioned by @Geiko is more-or-less the main use case that CTA was designed for - comparing different texts to find the one most suited to your current vocabulary, and then using CTA to extract frequently occurring unknown words from that text so you can learn them (either prelearning or just learning as you go) to make reading easier.

 

I’ll contribute with further thoughts after a few more people have responded (or after a few days if no-one else does). 

  • Thanks 1
Link to comment
Share on other sites

What's been working fine for me:

 

1. Importing the next novel I am planning to read into CTA.

2. Extracting all words with 2 characters or more (I have a separate deck for characters) that are mentioned at least twice.

3. Importing expored world list into Pleco, using Xiandai Guifan as a filter to remove words that are either very rare words or words that aren't actually "real" words (to take a beginner example, learning 外國人 as a separate word is a complete waste of time). I am not using the term lemma here since that's not really the same thing. I think most get what I'm trying to say.

4. Importing the new vocab in my Anki vocab deck, then sorting them by how frequent they are in the novel.

 

Picking words mentioned at least twice has been working well, and I think 98% gives a bit too many new entries for just one book, and the 98% I see thrown around is, if I'm rightly informed, mostly based on the English language? I know Paul Nation has done research that speaks about 98 percent and that's certainly for English. If you already know almost all characters you encounter, plus about 95 percent of the words, my experience has been that's probably comparable to those 98 percent in English, but that's just my hunch. For me it's been enough at least for pleasure reading, not looking things up. I'm actually going to experiment in decreasing that to about 90 % and see if that's enough, I wouldn't be surprised if it is, but it might be pushing it a bit too far.

  • Like 2
Link to comment
Share on other sites

5 hours ago, timseb said:

the 98% I see thrown around is, if I'm rightly informed, mostly based on the English language?

 

@Olle Linge's introduction to extensive reading on hackingchinese.com also mentions 98% (and gives examples of what lower percentages would feel like, using English), but having scanned the article again I can't see the source.

 

Thanks for the input so far, guys!  I had this feeling that people are coming up with their own workflows, so it would be great to have a "cookbook" of tried and tested recipes for CTA. And Anki, and Pleco too for that matter... 

  • Like 2
Link to comment
Share on other sites

2 hours ago, mungouk said:

@Olle Linge's introduction to extensive reading on hackingchinese.com also mentions 98% (and gives examples of what lower percentages would feel like, using English), but having scanned the article again I can't see the source.

 

The source here as well seems to be from English, rather than Chinese. He refers to Marcos Benevides who is "an assistant professor and coordinator of the reading and writing curriculum at J.F. Oberlin University’s English Language Program in Tokyo". I'm guessing he bases his assumption on aforementioned Paul Nation, who is I think the most famous researcher on this. I'm struggling to see how the 98 percent could be applied to such vastly different languages, one should at least be careful in doing so, I would assume. Reading how he learned all characters in just a couple of months I guess he means the meanings and nothing else, it should be pointed out I'm doing all readings (and their separate common meanings) from the 通用规范汉字字典.

 

In other words, when I enter my next book, I will have at least 99 percent of all the characters (possibly 100) and at least 95 percent vocab coverage. Whether or not that is enough to actually pick up new words from context I guess I can't prove neither to myself nor anyone else, but it's certainly enough to engage in pleasure reading and not look things up. There are of course a few words I *could* look up if I wanted to, but that goes for Tolkien too, and though my English is quite poor in writing since I rarely use the language, my listening ability is close to my native tongue and the reading is not that far behind, although the gap between reading my native tongue and English is larger than the gap in listening ability. Writing though, boy do I need practice.

 

Please note that I've only been doing Chinese for a year soon, so it's still trial and error. I will move on to the third Harry Potter book tomorrow and have been going for 95 percent from now on, in other words moving away from word mentions to a percentage point. I will report back and see if it makes a difference. I have never read the Harry Potter books in any other language but am of course familiar with the universe like most people. The second book took me 20 days to finish, as I have been reading one chapter a day.

 

I guess not all this post was directed only to you @mungouk but posting I guess posting twice would be a worse option.

2 hours ago, dougwar said:

can you guy suggest or make a list of good books for this task or what to read? the book of month post is a little over overwhelming for me

 

Do you have an estimate at all of your vocabulary size? Others know this better than me, but I think even an easier book will have at least a few thousand words and almost 2000 characters.

  • Like 1
Link to comment
Share on other sites

Quote

Do you have an estimate at all of your vocabulary size? Others know this better than me, but I think even an easier book will have at least a few thousand words and almost 2000 characters.

I know, that is the reason, a big list of books to increase my vocabulary step by step, using CTA.

Link to comment
Share on other sites

Just now, roddy said:

But where are you starting from? If it's a very small vocabulary, you might be better off with graded readers. Medium sized, you could tackle simpler novels.

 

Or news, but I know that's not for everyone.

 

I really think you should look into flashcard programs of any kind (Anki, Pleco etc) and learn an amount of words per day that you're comfortable with, and it's going to build up over time. Just do the maths before you pick a number... 

 

Intensive reading is not for everyone, I never really did it as I get tired of it very quickly. I barely read anything before my vocabulary was large nough to read a novel.

Link to comment
Share on other sites

On 8/17/2020 at 8:06 AM, imron said:

I’ll contribute with further thoughts after a few more people have responded (or after a few days if no-one else does). 

Come on, I can't keep getting up early to check if you've replied...

Link to comment
Share on other sites

It's a shame Chinese doesn't seem to have word families in the way English has (at least that's what I've gathered from reading a little about it, correct me if I'm wrong). If I am going to read a difficult book in English, I have noticed that scanning the book for word families keeps the amount of new words down at a managable level, even for someone like me who is not a native speaker. Studying word families in English has been way more effective than studying individual words, or lemmas, and whatnot. Once again, it's something I've picked up from Paul Nation.

 

I know Pleco does split the words into smaller fractions, but I'm guessing that is difficult to implement in CTA, since knowing *where* to split is something that's not really possible in automated situation? I mean, if a word is split into two in Pleco, and I open one of those words, it's going to be split again into individual characters. The way I'm learning individual characters, their readings and meanings is actually inspired from Paul Nation's discussions on word families, and I do appreciate the results from it, but in the end, characters aren't really word families (perhaps stems, rather?).

 

Prelearning all words in The Two Towers via learning word familiies, makes it entirely realistic to actually know all words in the entire, quite difficult, novel. Prelearning all the words in a Chinese book of similar length would take ages. At least that's been my experience so far.

 

Does anyone know of any lists/research at all that would make such a word family approach viable? Or are the languages just too different in their structure? In English it's mostly nouns (birds, plants, tools) I still have trouble with, but in Chinese it goes for all word classes.

Link to comment
Share on other sites

I am sorry, if this question has already been answered..

 

i just bought the license for CTA and would now like to get started.. 

 

how do I export an pleco ebook to CTA.. 

 

so actually it’s more a pleco than a CTA question..

 

i am 99% reading on my phone, so it will take me some effort to get my known words into CTA I guess..

  • Like 1
Link to comment
Share on other sites

5 hours ago, kanumo said:

how do I export an pleco ebook to CTA.. 

 

Generally, it would also be very useful to know if there's a "safe" way of converting/importing a PDF, epub etc to get it into CTA... how do we guarantee "one word per line"?

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...