Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

It looks interesting Imron. What about putting some screenshots on your site? Many people (me) like to see how a piece of software looks like before they purchase it or download the trial version, in order to have a very rough idea of the functionalities and the look and feel.

  • Like 2
Link to comment
Share on other sites

Yeah, that's on my list of things to do :mrgreen:

 

In the meantime, attached is a screen shot of the program with the script for 武林外传 loaded against HSK-6 word list.

 

post-462-0-42266100-1397029592_thumb.png

 

And also the dialog for exporting vocabulary from a given document

 

post-462-0-89835300-1397030515_thumb.png

  • Like 3
Link to comment
Share on other sites

Since I read pretty much everything on paper, this would be the kind of tool I'd never use, but it sounds so extremely useful that in this case I might actually use it.

  • Like 3
Link to comment
Share on other sites

I've been using this since the weekend and although I don't want to 拍 any 马derator's 屁, I think it's fantastic (I have no idea how it compares to other similar programmes out there). Each time I get to a new chapter in the books I'm reading I've started pasting that text into the programme -- before reading -- and then export all the unknown words: I skim through them and try to 'learn' a certain number of those I think most worthwhile. I particularly like how the programme will also export the actual sentence that each unknown word occured in. I like to think that when I then start reading the chapter, and come across those words, they are learned better than if it was the other way around (i.e. first read, then take out unknown words, then learn them).

 

I've also started doing the same with news: collect text from several articles on the same topic, paste it all into the programme, and find the most frequent unknown words across those texts, and skim or learn them before starting reading.

 

I'm on a reading & vocab binge at the moment so it's been perfect timing.

 

Lu, there's probably a decent chance what you're reading is online too which would let you copy/paste before returning to the paper book.

  • Like 4
Link to comment
Share on other sites

This looks promising, I may well buy a copy. I'm wondering though how similar is this to lwt.sourceforge.net and http://www.mandarintools.com/dimsum.html? And does it offer a way of automatically looking up the definitions and romanization for the words?

 

You might want to provide a list of formats as I could see this being a tool to work along with LWT and Anki. If the segmenting is any good, this would make my life a lot easier.

 

Also, I really appreciate that the pricing page warns about the possibility of a foreign transaction fee, it's nice to have a warning about that for those of us that have a choice of CC that includes one that doesn't charge the fee.

Link to comment
Share on other sites

I used it a bit today, and over all it looks quite nice.

 

But, a couple of things would greatly improve things. I appreciate the options for the word, but it would be nice to have one which is the pinyin and the character together. When I do my first run through on vocab, I like to do it like that. Then on later runs I'll match the pinyin to the character.

 

A preview of what the export is going to look like would also be very helpful. IMHO, it doesn't matter whether it's a generic example, or one that's taken from the actual output, but it would make it easier when exporting.

 

For whatever reason there's a bunch of boxes in the output. They don't seem to be causing any trouble, so I'm not really worried about it, but it could be confusing for people using the program.

 

Over all though, it looks very nice and I'm likely to buy a copy as it appears that it will save me a huge amount of time.

Link to comment
Share on other sites

I'm wondering though how similar is this to lwt.sourceforge.net and http://www.mandarint...om/dimsum.html

 

They are similar in quite a few ways in that they are all designed to segment text and help you learn vocab, but different in others.  Probably one of the main differentiators is performance.  Chinese Text Analyser was written from the ground up to handle large amounts of text in a short amount of time.  Neither of the above tools work well when dealing with anything other than short articles, and struggle at novel length texts (though DimSum is significantly better than LWT in this regard).  Chinese Text Analyser handles megabytes of text with ease, and gigabytes of text without raising a sweat.  I would encourage you to do a side-by-side comparison of segmentation and statistics gathering time for each of the programs with something like full script for 武林外传, or the full text of a novel of your choice, in order to get a full appreciation of this.

 

Next is ease of use.  LWT is a pain to set up and install on your local machine (some might even say a nightmare), and if you don't do it on your local machine you'll have resource constraints on whatever server you are running it on making it even more impractical to use for long texts.  DimSum is much better than this but still has dependencies on Java runtimes.  Case in point, I just downloaded the latest release but after downloading need to update and download Java runtimes before it works - ugh.  By comparison, Chinese Text Analyser has no external dependencies and takes two clicks to install once downloaded.

 

In terms of total features, LWT and DimSum are still ahead of Chinese Text Analyser, but what Chinese Text Analyser does, it does well, and I'll be continuing to add features as time goes by.

 

And does it offer a way of automatically looking up the definitions and romanization for the words?

 

It does not have an automatic way of looking up definitions and romanization of words, but this is intentional because I believe this encourages bad habits and is detrimental to long-term learning objectives and lets you fool yourself into thinking you understand things when you do not.  As such, Chinese Text Analyser will happily export pinyin and English definitions for words in a document, but it doesn't provide automatic lookup because I want to discourage the user from continually looking up words while reading.  Instead they should be actively marking words as unknown and then exporting lists of those words for use in other programs such as an SRS tool.  I do plan to add a feature in the future that allows manual lookups (e.g. by right clicking on the word or similar) but then automatically marks those words as unknown for a fixed period of time.

 

 

You might want to provide a list of formats

 

At the moment it reads UTF-8, UTF-16 or GB*.  Adding support for other encodings is trivial if there is a demand.  Output is always UTF-8 text.

 

 

If the segmenting is any good, this would make my life a lot easier.

 

The segmenting at the moment is passable, but not as good as I want it.  I have a number of ideas for improvements and have designed the program in such a way that I can easily add or replace new segmenters (including for other languages), but I wanted to get the product out first before returning my focus to improving the segmentation.

 

I appreciate the options for the word, but it would be nice to have one which is the pinyin and the character together. When I do my first run through on vocab, I like to do it like that.

Can you explain this in a bit more detail - when you say 'options for the word' what do you mean, you should be able to export pinyin alongside the characters if needed.

 

A preview of what the export is going to look like would also be very helpful.

Already on my list of things to do :mrgreen:

 

For whatever reason there's a bunch of boxes in the output.

Can you send me a screen shot, and also a sample of the text that has the problem so I can figure out what is causing it?

Link to comment
Share on other sites

Thanks for the response. I totally understand the lack of automation on parts of it, I've personally got mixed feelings about it, but I definitely see merit to that.

 

I've attached a screen shot there. I should probably mention that I'm using Crossover to run the application, so that might be the source of the boxes. As far as I can tell, it's how the tabs are being reflected, but I suppose it could be something else. But, since it doesn't show up in the exports, I wouldn't consider it a high priority.

 

I've found flash cards for Chinese to be a bit of a pain because there are different words that correspond to the same Pinyin. There's 3 different cards to do what in most languages would be just one. So, if I'm learning 这个, I'd generally have a card that has  这个 - zhè ge on the front and the definition on the back. Then I'll generally  create another card that's just the 这个 on the front and the definition on the back.

 

If that's a dumb way of doing it, feel free to say so and not implement it, but because of the pinyin not being unique to a given word, any cards that involve pinyin are going to need something to make them unique in a way that's useful to the learner.

 

post-50170-0-90919400-1397090227_thumb.jpg

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...