Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

If you import all your Pleco cards, you may have to mark words that you haven't fully learnt yet as 'unknown', but typically you'd only do this when you come across them in reading and realise that you don't actually know the word.

 

You can also manually mark new words as 'known', or you can choose to 'mark words as known' when you export them from CTA, under the assumption that you're going to spend time learning them elsewhere (e.g. importing them to Pleco or some other tool) and will learn them soon enough.

 

There will be a small amount of attrition, where you come across a word previously marked as known that you now can't remember, but I've not found that to be a burden.

 

The more you use CTA the less you should have to manually mark things.  There will be an initial period where CTA is learning what you know, but once that is out of the way you shouldn't find it too much of an issue, and if it is, it's probably indicative that you are choosing texts too far beyond your current ability and you should consider reading things at an easier level - CTA will also help in this regard because it will give you a clear indication of how well you'll be able to understand an unread piece of text.

Link to comment
Share on other sites

  • 1 month later...

Loving CTA: Any news on the native program for Mac?

 

I've been using the windows version of CTA with playonmac: I can confirm that it works and is usable, although it does have a number or irritating bugs. Nonetheless I have been using it frequently.

 

I usually put news articles that I have read into the same text document, and then use CTA to identify the most frequently occurring unknown words in this simple corpus. I then take an example sentence containing those words and stick them into anki. Very effective. As others have also recommended, Evernote is a convenient way of storing articles that you read for putting into your corpus later.

 

After a native mac version, it would be great to see some improvements to parsing, or at least the option for customisation.

 

Thanks again to Imron for this superb tool!

Link to comment
Share on other sites

I've done about half of the remaining work necessary for the native OS X version since I last mentioned it (I've wanted to do more, but the last few months have been very busy with other stuff and haven't had the chance).

 

Anyway, the only things missing now are the popup definitions, a search dialog, tabs for multiple documents, a couple of miscellaneous GUI things and then a few website things.  Other than that, it's more or less feature complete with the Windows version.

 

If you could happily use the program without those things, drop me a message and I'll see about getting you a test copy.

 

Regarding bugs (especially irritating ones), are they PlayOnMac specific, or CTA specific?  If the latter, please be sure to let me know so I can get them on my list of things to fix.  Likewise examples of parsing mistakes.

  • Like 1
Link to comment
Share on other sites

Glad to hear the mac version is coming along, sounds like a lot of work!

 

The bugs are (I assume) PlayOnMac specific - sorry if that wasn't clear in my post above. I haven't used the Windows version since setting it up in PlayOnMac.

 

Regarding the mac test version - will no search dialogue prevent me from finding example sentences for target words? If not then I would be interested in a test copy, although my regularity of use of CTA will likely be less in the next few weeks than it has been recently.

Link to comment
Share on other sites

Hi Imron,

 

I just had a quick question. I will be switching from a laptop to a desktop shortly. While I'm happy to pay for a new license, I'm just wondering if there's an easy way to port my known vocabulary from one version to the other.

 

Thanks in advance,

John 

Link to comment
Share on other sites

For personal use, a licence can be used on any computer where the licence owner (in this case you) is the primary user (in this case also you), so no need to buy a separate licence, your existing one will work fine (though you are always welcome to buy more copies if you wish :-) ).

 

To port your known vocabulary, simply copy:

 

c:\Users\<username>\AppData\Local\ChineseTextAnalyser\wordlists

 

to the same location on your new computer.  You may need to run CTA at least once on your new computer to make sure that this directory exists.  Then you can just completely override whatever is in the wordlists directory.

  • Like 1
Link to comment
Share on other sites

  • 2 months later...

The installer currently requires administrator privileges, however that will be changing in a future version.  The program itself doesn't need administrator privileges to run, so you can happily copy the entire 'c:\program files\chinesetextanalyser' directory to another computer and it should work just fine.  It doesn't store anything in the registry except uninstall information.  All the user data is stored in the user directory: c:\users\<your username>\AppData\Local\ChineseTextAnalyser.  You don't particularly need to copy this, as CTA will recreate all the things there as needed.  See a few posts above however for how to port your vocabulary between computers.

  • Like 1
Link to comment
Share on other sites

It's possible, assuming you are running version 0.99.4 or later, but there's not a nice easy way to do it.

 

If you haven't exited the program yet, the easiest way is to press ctrl-shift-esc to bring up the task manager, and then kill the application (cta.exe).  The updated word list is currently only saved to disk on program exit (well, mostly, it can happen at other times also, but that's the big one), so if you force kill the application it never gets a chance to save the updates.  If that works, then great, nothing more to do.

 

If not, you'll need to do a bit of editing of text files.

 

For each session (e.g. starting the application then using it, and then quitting), CTA remembers the list of words you added and removed (hereafter, referred to as a 'revision').  It mostly uses a cache that contains the union of all revisions (rather than building the wordlist each time), but it's also possible for it to rebuild the entire word list from scratch by starting with an empty list and then applying each revision in order.

 

Each revision has an ID number, and there is a file that also keeps track of the current revision ID for the known word list.  To get back to a previous version, all you'd need to do is update that file with the ID from an older revision and everything should work.

 

The way to do this is as follows:

 

First exit CTA if it is running, and then go to c:\users\<your username>\AppData\Local\ChineseTextAnalyser and make a backup copy the 'wordlists' folder (as a precaution).

 

Then, open the file:

 

c:\users\<your username>\AppData\Local\ChineseTextAnalyser\wordlists\lists\known

 

It's just a text file, so you can open it with a text editor such as Notepad++.  Normal notepad won't be good enough, as it doesn't recognise the line ending that I use so everything will appear as a single line.

 

Anyway, in it, there will be a line that says something like:

 

currentRevision = d38a1695973410666128fccda4f414994805e294

 

Although your revision number will be different.

 

Copy that revision id (e.g. d38a1695973410666128fccda4f414994805e294) and then open the file:

 

c:\users\<your username>\AppData\Local\ChineseTextAnalyser\wordlists\objects\known

 

(it's also just a text file), and then do a search for that id.  You should see a line that looks something like:

 

@d38a1695973410666128fccda4f414994805e294 11327c97967806a8e1860eb2c5e4cbce86ed5826 4153ad425431270c16469551de5ec0eeddd3ce60 2015-10-07 06:03:32.000

 

though once again your numbers will be different.

 

You are interested in the second sequence of numbers on that line e.g. 11327c97967806a8e1860eb2c5e4cbce86ed5826 because that is the ID of the previous revision.

 

All you then need to do is copy that revision id and then go back to the first file and change the currentRevision line to:

 

currentRevision = 11327c97967806a8e1860eb2c5e4cbce86ed5826

 

(though obviously with the revision ids from your file, not the ones from this post).

 

Note:  Don't change anything in the objects\known file (just quit without saving), because I do integrity checking and if anything changes then the integrity checking will likely fail and it will take some work to undo that.

 

Anyway, after you have changed the currentRevision in the lists\known file, save it and exit, and then run CTA and see if everything is back to normal.

 

If it's not, or if all of that seems a bit too complicated, zip up your 'wordlists' folder and send it to me in an email and I'll sort everything out.

 

In a future version, going back to previous revisions in the wordlist will all be possible from the GUI.

Link to comment
Share on other sites

Good to hear.  Maybe the extra info will come in useful to someone else someday :D  Though hopefully there will be a GUI for managing all this before too long.

 

For the curious, storing revisions in this manner means that when I add graphing (also coming eventually), current users will get graphs going back to when they started using the program, and you'll also be able to do things like find out the day you learnt a certain word on, and see your average rate of learning new words over time etc.

Link to comment
Share on other sites

  • 4 weeks later...

People like to say that you need 98% comprehension to make the most out of 'extensive reading'. I'm wondering how that 98% figure compares to the "Percent known" figure given by CTA.

 

I think the CTA figure would be lower than 98%: I tell the software that I know the word 德国 but naturally it won't trust that I know 德国人 too, without me saying so. I've told it I know 礼拜一, but it won't assume that I must also know 礼拜二. Lots of resultative verbs, 进去,咽下 etc are assumed unknown. There are also names of people, and mis-segmentations.

 

So basically the software -- through no fault of its own, there's no way round it -- underestimates my vocabulary. Does anyone have a sense of what "Percent known" figure from CTA tends to mean a text is at that easy-enough-for-extensive-reading level? For me I'm finding 94% indicates a fairly easy text.

  • Like 1
Link to comment
Share on other sites

Before starting to read a new novel, I study the most frequent unknown words until I reach 95% of comprehension. Then I read it, and of course while reading I'll encounter new words, some of which will be added to my flashcards if I find them important, but my main focus will be simply enjoy the reading. I've used CTA for around 20 novels / short stories so far, and I'm pretty comfortable at a 95% comprehension.

  • Like 1
Link to comment
Share on other sites

@realmayo, That's a really interesting question, and once I have the time, something I'd like to explore with CTA is how to provide users with a better understanding of how well they'll be able to read and understand a text beyond just a '98%' figure.

 

@Geiko, great to hear some hard figures, and great to hear you're finding CTA useful.  Out of curiosity, what does CTA say for your current total known vocabulary, and typically how long (or how many words) does it take for you to get up to 95% comprehension for a new text?

 

there's no way round it

There probably are ways to mitigate it to some degree though - e.g. the program could parse/segment the dictionary wordlist and try to find words that are composed of other words and then cross-reference that to build a list of 'probably known words' or something.

Link to comment
Share on other sites

I'm doing the same as you, Geiko, & it's interesting we're both comfortable at a similar %.

 

Imron, another variable to consider is the point when a learned word becomes an unknown word will differ between people: if I get a word wrong twice in my flashcards, I suspend it and that eventually feeds through to CTA where the word is treated as unknown. But in context, rather than in flashcards, there's a decent chance I'll know the word, meaning that my actual comprehension is greater than the given %. On the other hand, there are bound to be some 'known' words which I end up forgetting on the day, which should lower the actual %!

 

Probably the answer is to put a variety of not-too-short texts into CTA, note the %, then go through marking all known 'unknowns' known, and compare the new % to the first one. 

 

But no need to go to all that trouble because as Geiko says, people can just work out what CTA % translates to 'comfortable level for me to read', and use that as a guide when reading new texts.

 

Also on the 98%, where does this figure come from? Does it mean 'know 98% of the words' or 'understand 98% of the text' because if it's the latter, then a certain amount of intelligent guessing while reading will increase comprehension but can never be captured by any programme that works off a list of known words.

 

Anyway, I guess the important thing is, assuming it's not just Geiko and myself who have read-comfortably texts below the 98% mark, that people using CTA for the first time aren't put off reading material that returns a % lower than that 98%.

Link to comment
Share on other sites

Also on the 98%, where does this figure come from?

I first heard about it from the video linked to in this thread.  The video has a pretty clear example, progressing through different percentages.  It doesn't say that you need 98%, just that at 98% it becomes easy to learn new words almost exclusively from context.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...