Introducing Chinese Text Analyser

May 9, 2014 at 07:57 AM

any chance of adding this in the future?

Now already added in my development version.

How are the word combinations analyzed?

Word segmentation is currently very simplistic (I plan to add better segmenters later). There is just a list of words (c:\program files\ChineseTextAnalyser\data\words.u8) and the program performs a forward longest matching word algorithm based on this list. It's possible to substitute/add to this list if you want, but *Warning* it's not guaranteed that version upgrades will preserve changes, and in fact very likely that it will overwrite them. I plan to add custom words in a later version. *Warning*

However adding pinyin,and a translation (or a link to ZDIC or another online dictionary) as an optional column can do the work and make it more powerful.

These can currently be exported if needed. However I want to make people realise that looking up a word is an implicit acknowledgement that you don't know that word well enough yet. As such there will never be an option to passively see the meaning or pronunciation of a word. Looking up a word will always involve a cost (there will be an option in the next version to ask to show the definition, but this will also mark the word as unknown for a fixed duration).

The purpose of CTA is to tell you when you don't know a word well enough to read it in context, and encourage you to export such words to another program to study further - 严师出高徒 Any sort of passive option bypasses that.

I usually scrape texts from the internet and would like to throw it in directly into CTA.

When you scrape texts from the Internet how do you normally deal with them if not already saving to a file? I have plans at some point for a separate editor program based on the same technology, however that represents a significant development effort to do well so it won't be for a while.

May 9, 2014 at 09:58 AM

When you scrape texts from the Internet how do you normally deal with them if not already saving to a file?

Perhaps the quick and dirty solution is to add a Paste button which takes the contents of the clipboard and uses it to create a temporary file, then reads in that file?

May 9, 2014 at 10:23 AM

CTA already does this, and prompts you to save the file on closing if it was created from the clipboard.

May 9, 2014 at 02:33 PM

Thanks for responding to all my comments, Imron.
I'm glad a lot of them have made it to your list of things to do. It seems like that list is getting really really long!

I think I would most like to see the Bookmark feature implemented in a coming version. I'm sort of a slow reader right now, so I can only really read a few pages at a time in my short stories and such.

Thanks also for explaining what the Pre and Post mean. It sounds like a pretty complicated but powerful feature. It might be nice if you had some examples of how to utilize it (like the one you gave in your post) on your website or on a help document or something. I think that this would be one of those features that gets underutilized if there isn't enough guidance on how to use it.

I think your double click to mark words as known will be a good idea. I guess I'm used to keyboard shortcuts, but I see how that would be hard to implement.

I'm curious what you plan to have in your future Chinese Text Reader project? Will it be significantly different from your Text Analyzer? I'm sure its probably years off, but I was just wondering.

I'm emailing you separately with the complete source text that I used in my export tests. It was just a copy of Water Margin that I downloaded from somewhere. Maybe you can see if you can replicate the leading period bug and the putting the cloze sentence on the second line bug that I saw in some of the test cards.

Thanks

May 9, 2014 at 07:38 PM

It seems like that list is getting really really long!

It's currently about 30 items before being ready for what I want in 1.0.0, and 60 items in total. Some of those are small things, some of them are bigger things, and the numbers are always changing as I think of new things or get new suggestions.

I guess I'm used to keyboard shortcuts,

I'm a huge keyboard user myself, only relying on the mouse when I absolutely have to. You'll note that almost every feature in CTA has a keyboard shortcut associated with it, and the astute among you may even have noticed a handful of vi keybindings also work (j, k, n, ctrl-u, ctrl-d) with more planned in the future

Anyway, I'll have a think to see if I can come up with a reasonable way to do this.

It might be nice if you had some examples of how to utilize it (like the one you gave in your post) on your website or on a help document or something.

I definitely plan on having a detailed manual, however that'll take a whole chunk of time away from development, especially considering the product is still in a state of flux with new features being added and changes happening to the interface. I don't want to always have to write and rewrite sections, and so I've decided to wait until a later release (1.0.0 at the latest) before putting the manual together.

I'm curious what you plan to have in your future Chinese Text Reader project? Will it be significantly different from your Text Analyzer?

It will use the same segmenter and document viewer, but will be focused more on reading texts, tracking progress over time, drills for increasing reading speeds, providing suggested content based on reading level and more. It's not years off, but definitely months.

Maybe you can see if you can replicate the leading period bug and the putting the cloze sentence on the second line bug that I saw in some of the test cards.

Thanks. I can now replicate both of these, and should have a fix in the next version.

May 13, 2014 at 09:50 PM

That does sound like a lot of work still to be done, but I think your program is very good so far, and I'm sure it will be helpful for a lot of people.

The Text Reader project sounds really useful too. I think I could benefit from all of those features. How are you planning on providing suggested content based on reading level?

May 13, 2014 at 10:50 PM

How are you planning on providing suggested content based on reading level?

Manually creating and curating content, and allowing people to search that against their known vocabulary.

May 14, 2014 at 03:52 AM

@imron, it's a shame that probably the most useful feature is one that's likely to take quite some time to get right, if ever. By which I man text segmentation that handles things like 跟——一样。OK, I guess technically that's not a new feature it's just an improvement on an existing feature.

But, the software is already something that I use regularly with my ebooks whenever I want to learn the vocabulary ahead of time. I just love the fact that it can slurp up an entire book into an Anki deck and have the vocabulary roughly sorted by order in which it appears. It's so nice to be able to just read simple things knowing that I've learned all the necessary vocabulary before I've even begun.

May 14, 2014 at 04:20 AM

By which I man text segmentation that handles things like 跟——一样

Can you explain a bit more what you mean by this?

CTA can already search for grammatical patterns like this, just search for 跟*一样, or 跟…一样. A planned feature is to then allow sentence mining based on search patterns, so for example you'd provide a list of search patterns it would parse a document and spit out sentences (with optional cloze deletion) that match the pattern.

It might make the occasional mistake, but I've found with search it's generally pretty accurate and you can then just copy and paste the sentences.

May 14, 2014 at 07:19 AM

the astute among you may even have noticed a handful of vi keybindings also work (j, k, n, ctrl-u, ctrl-d) with more planned in the future

Sweet. But do you have any plan to implement an emacs mode, so that us normal, orthodox users can type logical and elegant keybindings such as Ctrl-Alt-m (with the left hand) Shift-v p (with the right hand) M-t (with the nose) to open a file?

May 14, 2014 at 07:37 AM

At some point I'd like to add customisable key-bindings so Emacs users will be able to create their own - no Lisp interpreter however (though I have toyed with the concept of adding Lua scripting support), and neither will the program expand until it can read email.

May 14, 2014 at 11:32 PM

@imron, sorry about that, it's probably more complicated to describe than I had though.

Being able to define structures the way that we currently can define multi-character words would be quite helpful. Basically allow us to define a word that has empty space in the middle where other words and characters are permitted.

Unless of course, I've missed that functionality, in which case never mind.

May 15, 2014 at 12:36 AM

Currently you can't define such patterns yourself and have them stored, and/or highlighted when you move the mouse over them, and/or mark them as known/unknown, and/or export sentences containing them. Currently you can only use such patterns with the Edit->Find (Ctrl-F), using * or … as wildcards.

Being able to define structures the way that we currently can define multi-character words would be quite helpful. Basically allow us to define a word that has empty space in the middle where other words and characters are permitted.

This describes the process of what you'd like to do, but not the end result. Are you able to explain what you'd like to be able to use these defined patterns for?

For example, if you'd just like to be able to extract sentences using them that's significantly easier to implement (and requires less processing time) than having the program highlight them when you hover the mouse over them and/or recognising known/unknown grammatical patterns.

If I can know what you'd like to be able to do, I can try to figure out the best way to do that.

May 18, 2014 at 05:32 PM

@imron, I've been thinking about that for the last couple of days. I thought this was going to be relatively obvious, but you are right that there's multiple possibilities depending upon what exactly I'm looking to do.

I'm mostly interested in using CTA to create flashcards so that I can read without assistance. So, even just extraction without any intelligence would make a huge difference for me. Being able to quickly create cloze sentences to work with would be of great utility; even if it's not feasible to have a program intelligently finding and marking things things in the text.

It would be nice if it could take a step beyond that and recognize when one of these phrases pops up and ignores the characters involved with it, but you're absolutely correct that things like that are resource intensive and technically complex as there are myriad structures like this that would have to be checked.

May 20, 2014 at 01:29 PM

Version 0.99.3 is now up, and includes a number of nice features including:

Popup dictionary definitions (right-click, show definition).

Double click to toggle known/unknown status

Fixes for the export problems mentioned by kikosun

and more.

For those interested, there are still a few free licences up for grabs.

May 20, 2014 at 03:39 PM

I would like to try Chinese Text Analyser, I down loaded the free trial, had a quick look and thought hmmmm wonder what this is for?

I have been taking part in the short story reading group, but Meng Lelan is away for a couple of months and I suggested we could continue without Meng Lelan.

I spent ages looking for something suitable and wondering how I could check it would be at the right level and then I thought aha, it dawned on me that's what I could use Chinese Text Analyser for,

So I was going to apply for a free license but my free version has run out so I can't go to help and send feedback to apply for one.

Any suggestions?

May 20, 2014 at 09:57 PM

PM me your details. Asking people to submit through the feedback form of the program was to make sure people had actually downloaded it before asking for a licence.

May 24, 2014 at 03:40 AM

Imron offered me a free version several months ago, and I'm very impressed with the program and have incorporated into my daily Chinese reading. Its difficult for me to imagine going back to hodgepodge method I had before of trying to figure which words were most important to add to my flashcards.

Part of my job entails daily reading news and policy documents in Chinese, and while I was adequate with this before, CTA makes the process of incorporating new vocab into my flashcards much more logical and easier to boot. Also, I find reading the text in CTA better than on Chinese websites, which are typically quite cluttered and distracting. I use it less for books, which I read less, but my impressions are that its perhaps even more useful there (see suggestion below). realmayo already provided a pretty good account which is similar to how I use the program, so for those interested I'd suggest reading that post. Instead I'll provide some suggestions on features I'd like to see:

Persistent user corpus - currently the program checks for frequency within a given document - this is great, especially when scanning books or longer articles. On the other hand, I tend to read a lot of shorter news articles, so I feel like CTA is missing much of the overlap from article to article, which are all on the same theme/topic. I'd like to see the option of sort words by "frequency within this text" and "frequency within all texts I've scanned" - perhaps further customizable for texts scanned in the last X weeks or Y万字.
Scanning of just a section of text - this would be useful with large documents. For example, in a current book I just want to scan the first part of chapter 1 (第一篇贡品　1). Would be great if I had a way to select all text between the markers 第一篇贡品　1 and 第一篇贡品　2 and just scan that, without having to copy and paste in a separate document first. Also, if CTA could recognize some common chapter markers, such as those above, and split automatically that would be very convenient.

Overall very happy with the program, even in its current "pre-release" state - it already serves the main utility required to warrant the cost. Given the impressive progress since I started using it, I'm excited to see what other features imron has in store for it! Highly recommend the program to anyone that is trying to add structure to their learning from texts.

May 24, 2014 at 11:47 AM

Persistent user corpus -- something like that would be lovely. I achieve similar via a spreadsheet but it's clunky & fiddly.

May 25, 2014 at 03:06 AM

Excellent suggestions thanks. Some are already in my todo list, and I'll add the others also.

Sign In

Introducing Chinese Text Analyser

Recommended Posts

imron

Link to comment

Share on other sites

character

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

kikosun

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

kikosun

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

hedwards

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

laurenth

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

hedwards

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

hedwards

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

Shelley

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

icebear

Link to comment

Share on other sites

realmayo

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

Join the conversation