Jump to content
Chinese-forums.com
Learn Chinese in China

imron

Introducing Chinese Text Analyser

Recommended Posts

Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

imron
6 hours ago, mungouk said:

 

Generally, it would also be very useful to know if there's a "safe" way of converting/importing a PDF, epub etc to get it into CTA

The safest way is to open it in another application, then copy to the clipboard, and then paste it in to CTA. 
 

The one word per line importing is for importing wordlists.

  • Thanks 1

Share this post


Link to post
Share on other sites
icebear

Bug: if a multi character word breaks across two lines, a mouse highlight on the fragment in the second line plus hotkey (space to mark known/unknown) does nothing. Mouse over of the fragment on the first line + hotkey works as expected.

 

Additionally: with a 4 character word (错综复杂) using the 'd' hotkey while mouse over on the second line fragment reports no definition available. Mouse over the first line fragment + hotkey shows the correct definition.

  • Thanks 1

Share this post


Link to post
Share on other sites
imron

This is known bug on my list of things todo. 

  • Like 1

Share this post


Link to post
Share on other sites
roddy

Quick question possibly already asked - if simplified is fed in, and traditional exported - I'm assuming that's info from CEDICT and as such has been looked at by a human, rather than automated? 

Share this post


Link to post
Share on other sites
imron

It's info from CEDICT, and maps to the CEDICT entry.  You might want to do a test exporting a single character like 发 which has multiple traditional characters to see what it exports when there are multiple entries.  From memory you should get one exported entry for each different character.

Share this post


Link to post
Share on other sites
icebear
On 8/13/2020 at 2:39 PM, mungouk said:

I think what I'm missing is some good descriptions of use-cases and tutorials to show what it's capable of, and how I could be using it.

 

Are there any examples out there already on, say, youtube? 

If not, do any of you power-users feel like explaining how you use it to do things you couldn't do with other tools?

 

My goal is to understand what words I'm running into frequently across many texts, but maybe only appear a few times in a single text. I primarily read short to medium length Chinese articles (500-5,000 characters) on geopolitics and economics, both for personal interest and work, so a personal corpus is very useful; if I was mostly reading long books it may not be as important a step.

 

Reading:

  1. Find one article of interest
  2. Copy text into CTA to read, and into a personal corpus*
  3. CTA: mark entire document known, then read through and use mouseover + hotkey "d" to show the definition of any word I'm not certain of. I've requested an automatic "read along" feature that will reduce the need for a mouse during this part, and try to train faster reading (following along), but even as is this is pretty good.
  4. Find a new article of interest

Once weekly:

  1. Load entire corpus of Chinese articles into CTA
  2. Review top ~x (usually 30-50) words by frequency within corpus, mark as known any that I have high confidence around
  3. Copy remaining most frequent unknown words into Pleco flashcards

The corpus: for a long time I used Evernote Web Clipper to save articles to a "Chinese" folder, then used Evernote desktop once weekly to export that folder as a single text file. More recently I'm using a Firefox add-on called "MarkDownload" that saves the body of a web page as a nearly simple text .md file, which I can then merge later into a single file. Whatever the method, you need a single file with all read articles in  one place to easily copy into CTA.

 

You can reset this corpus occasionally, although I'm of a few minds on whether that's useful. On the one hand, I don't want to be studying words from very old-to-me texts that suddenly emerged as top frequency; on the other, my list of known and unknown words should be as current as the most recent article I read (since I mark all words known, and then show definition/unknown those I have trouble with going through). In principal this means in my weekly review I should only see words I recently had trouble reading. But, either point of view can be argued, and I do vacillate between them.

 

I've also requested a CTA feature that would mimic this corpus process (keeping a local word list with count of occurrences across all loaded documents in the past), but again this is a pretty easy workflow to maintain.

  • Like 1
  • Helpful 1

Share this post


Link to post
Share on other sites
roddy

I wonder if overall frequency info would be useful - not within the text in question, but in whatever corpora you take the information from. That'd help judgement calls on what's worth paying attention to - "Ok, it only appears twice in this text, but it looks to be relatively common in the language, so...." or "Comes up a lot here, but looks pretty obscure in general"

  • Like 1

Share this post


Link to post
Share on other sites
Jan Finster
On 5/25/2014 at 2:14 PM, fabiothebest said:

read that the program allows to  "Export word lists of known or unknown words for use in SRS or other programs". I'd be interested in creating worlists for Pleco. Is it easy to do?

 

@Imron: I cannot figure out how to export my list of known words (under menu "word lists") at all. 🥺When I open menu--> export--> all 3 options are faded (unclickable). I am talking about my "reference list of known words" (under menu "word lists"), not a word list generated after CTA analyses a random text. I would like to review my list of known words to see HSK levels, character numbers etc.

When I go to menu "word lists" --> my word list --> edit, I can see all words in the small window and select, but not copy them (to paste them elsewhere)... (?)

Share this post


Link to post
Share on other sites
imron

There is not currently an easy way from within the app itself to export all your known words (the export function works on the current document).

 

However there is an easy way to get at the list of known words, because they are all cached in files in the following directories:

 

Windows:

C:\Users\[username]\AppData\Local\ChineseTextAnalyser\wordlists\cache

 

macOS:

~/Library/Application Support/ChineseTextAnalyser/wordlists/cache/

 

Linux:

~/.local/share/ChineseTextAnalyser/wordlists/cache

 

That directory will contain one file per wordlist, with each file containing one word per line.

 

Note:  Don't make changes to this file as this is not how CTA stores the wordlists internally.  If CTA detects the cached files have been changed it will just overwrite them based on its internal copy of the wordlist.

  • Helpful 1

Share this post


Link to post
Share on other sites
Borkie
On 9/7/2020 at 7:00 AM, icebear said:

I primarily read short to medium length Chinese articles (500-5,000 characters) on geopolitics and economics, both for personal interest and work, so a personal corpus is very useful

 

Would you be willing to share this corpus? I'm studying international relations and it'd be great to run it through CTA and find some geopolitics related vocab to learn. 

  • Good question! 1

Share this post


Link to post
Share on other sites
laurenth

Hello, is there a way I can force CTA to work on a character-by-character basis (no parsing) for classical Chinese?

 

[Edit] Oops, my question has an answer, sort of, on p. 32. Sorry.

Edited by laurenth
  • Like 1

Share this post


Link to post
Share on other sites
imron

There’s not really a way to currently do this. 

Share this post


Link to post
Share on other sites
tiantian

Hi imron,

 

is there a way to open a pre-segmented txt file (i.e. a book neatly segmented with stanford nlpcore) and textanalyser uses that segmentation (skips own segmentation pass)?

Share this post


Link to post
Share on other sites
imron

Not currently unfortunately. 

Share this post


Link to post
Share on other sites
jannesan
On 1/21/2021 at 10:51 AM, tiantian said:

is there a way to open a pre-segmented txt file (i.e. a book neatly segmented with stanford nlpcore) and textanalyser uses that segmentation (skips own segmentation pass)?

 

If you managed to use that library to segment the text yourself, I bet you can also write a script that does the analysis of the segmented text:)

I guess this is even possible with Excel, but don't have any clue about that. In case you can't figure it out, send me a message, I could write something basic for you.

Share this post


Link to post
Share on other sites
realmayo
On 1/21/2021 at 9:51 AM, tiantian said:

is there a way to open a pre-segmented txt file (i.e. a book neatly segmented with stanford nlpcore) and textanalyser uses that segmentation (skips own segmentation pass)?

 

Surely if text has already been segmented - with spaces between words - then CTA will respect those words as words, except where it doesn't have the words in its database (and therefore breaks them down into individual characters)?

Share this post


Link to post
Share on other sites
tiantian
Quote

Surely if text has already been segmented - with spaces between words - then CTA will respect those words as words,

Oh have have to try this then!

Share this post


Link to post
Share on other sites
tiantian

Something different:

Based on @imron's provided lua scripts, I wrote a script "example sentence extractor" that does the following:

 

It asks for a corpus txt file of Chinese text.

It asks for a wordlist txt file of your unknown words (one line per word).

 

It outputs a list of example sentences for each unknown word, the unknown word is marked in brackets. Only sentences are selected where you know 80% of all words (the number is adjustable in the script). This keeps the example sentences easy enough to concentrate on the unknown word.

 

I am advocate of learning new words with example sentences and this is just an easy way to create some for the words you want to learn that consist mostly of your known vocabulary.

 

I use a corpus of around 1000 books.

 

I am no coder, so the question I have is: Is there a way to make this script faster? Obviously, the bigger a corpus you use, the more sentences you get. But it will take hours to run the script. Too many loops I guess.

Currently, it loops through the list of unknown words then through all the sentences that CTA found, then through all the words of the sentence.

 

I share the script here because I find it useful and perhaps somebody might have an idea to make it faster (I kind of just pasted stuff together from imron's scripts so most credits go to him).

 

P.S.: @imron Do you still plan to integrate a corpus feature of this sort natively?

examplesentences.lua

  • Like 2

Share this post


Link to post
Share on other sites
imron
8 hours ago, tiantian said:

Do you still plan to integrate a corpus feature of this sort natively?

Yes, but I haven’t had the time to work on it. 
 

I’ll take a look at your script a little later. 

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...