Jump to content
Chinese-forums.com
Learn Chinese in China

Introducing Chinese Text Analyser


Recommended Posts

Jan Finster
9 hours ago, tiantian said:

Something different:

Based on @imron's provided lua scripts, I wrote a script "example sentence extractor" that does the following:

 

It asks for a corpus txt file of Chinese text.

It asks for a wordlist txt file of your unknown words (one line per word).

 

It outputs a list of example sentences for each unknown word, the unknown word is marked in brackets. Only sentences are selected where you know 80% of all words (the number is adjustable in the script). This keeps the example sentences easy enough to concentrate on the unknown word.

 

I am advocate of learning new words with example sentences and this is just an easy way to create some for the words you want to learn that consist mostly of your known vocabulary.

 

I use a corpus of around 1000 books.

 

I am no coder, so the question I have is: Is there a way to make this script faster? Obviously, the bigger a corpus you use, the more sentences you get. But it will take hours to run the script. Too many loops I guess.

 

This is a great idea. Thanks! 😊

 

I did this manually and googled/baidued for sample sentences: https://www.chinese-forums.com/forums/topic/60185-learning-vocabulary-by-context-mining/

 

I tried your script with a more modest list (300 words) and just one reference book as corpus. It took 2-3 minutes or so, which is fine  by me.

I would however suggest:

a) to skip non-machtes rather than listing them with empty results

b) maybe add statistics on how many results were found or not found

c) I did not like having the unknown word marked in brackets, but I am not sure if there is a better way to highlight it (?) I guess deleting the brackets with notepad++ is not a big issue, but still, this would be yet another step...

 

Thanks for sharing!!!

 

 

For the coders here: if someone could turn this idea into an app that mines sample sentences from the internet (google, baidu, etc), this would be even more amazing.

Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

tiantian
Quote

I tried your script with a more modest list (300 words) and just one reference book as corpus. It took 2-3 minutes or so, which is fine  by me.

I would however suggest:

a) to skip non-machtes rather than listing them with empty results

b) maybe add statistics on how many results were found or not found

c) I did not like having the unknown word marked in brackets, but I am not sure if there is a better way to highlight it (?) I guess deleting the brackets with notepad++ is not a big issue, but still, this would be yet another step...

 

Yes, my corpus was like a 1GB textfile, maybe too big.

One thing I forgot to mention: the attached script currently uses the HSK 1 to 5 words as known words. If you want to generate example sentences based on your own CTA known words list you have to change line 38

                        local known = cta.hskLevel( 1, 5 )

to

                        local known = cta.knownWords()

 

a) yeah I thought about this but I liked them listed as empty so that I can see for which words no example sentence was found.

b) yes, but I am not sure if I know how to do it :)

c) this is probably personal preference. you can also mark them with just spaces or not mark them at all. Just delete the brackets from line 52:

                        print( sentence:clozeText( word, ' [%w] ' ) )

like this

                        print( sentence:clozeText( word, ' %w ' ) )  

  • Like 1
Link to post
Share on other sites
imron
On 1/27/2021 at 10:14 AM, tiantian said:

I am no coder, so the question I have is: Is there a way to make this script faster?

Yes.

 

Currently the script processes the entire file once for each unknown word, so if you have 500 unknown words and 1GB of text, you end up processing 500 GB worth of data.

 

You'd get a significant speed up if you only processed the file once, and wrote out words as you went.

 

Doing it this way however will mean that things aren't sorted by word.  To fix this problem, what you could do is output the word followed by a space at the beginning of each line, followed by the cloze sentence, and then when you are done run the output through a program to sort each line lexicographically.  This will sort all the sentences with the same word together because you put the word at the start of the line.

 

The other thing you could do would be to write each sentence to a file, and have one file for each unknown word, so when it finished you'd have 500 files each with sentences for that word (though opening/closing files would likely be slower than the previous approach with sorting).

 

Another option would be to have a table keyed by word that saves a list of each sentence for that word, and then print out that list at the end.  The tradeoff with this approach though is that it would take up a lot of memory.

 

The other thing slowing the script down is for each word you are calling sentence:words() twice for each word (once in sentenceMostlyKnown and once in your own loop).  This is a slow function because it parses and analyses the sentence each time.

 

If you look at sentenceMostlyKnown, you'll see that one of the return values is a table containing the list of unknown words in that sentence, so instead of calling sentence::words() again and looping through the results, you could just test to see if the `word` is in `unknown` and if it is then print out the sentence.

 

Combining both of these things should result in significant speed improvements.  Let me know which approach you like for solving the first problem and I can help make the changes required.

 

 

  • Like 1
Link to post
Share on other sites
tiantian
On 1/31/2021 at 3:27 AM, imron said:

You'd get a significant speed up if you only processed the file once, and wrote out words as you went.

 

On 1/31/2021 at 3:27 AM, imron said:

Doing it this way however will mean that things aren't sorted by word.

 

Ah I see, a small compromise for a big gain in speed! Thank you for helping me wrap my head around this. I went with your first approach.

 

On 1/31/2021 at 3:27 AM, imron said:

The other thing slowing the script down is for each word you are calling sentence:words() twice for each word

 

Oh, I can use the returned "unknown" from the function, now I see.

 

That made it much quicker! In addition I also used this advice:

"Don't use pairs() or ipairs() in critical code! Try to save the table-size somewhere and use for i=1,x do!" (https://springrts.com/wiki/Lua_Performance#TEST_9:_for-loops)

while looping through the unknowWords wordlist.

 

The script is much, much faster now and even handles my big gigabyte corpus in a few minutes on my 10 year old laptop.

 

The output file is word <tab> sentence. Although not sorted, one can just paste it into excel and sort it real quick.

 

I share the script here. I find it really helpful for sentence mining if you have a big corpus of chinese text. The big plus is that all the sentences will always be based on your known words and thus will be easily readable.

 

 

@imron is it possible to call the unknown words of the currently opened CTA document, kind of like cta.unknownWords()? That would be an easy way of generating example sentences based on the unknown words of what you want to read.

examplesentences2.lua

  • Like 1
Link to post
Share on other sites
  • 2 weeks later...
  • 4 weeks later...
realmayo

Where are existing wordlists saved? I'd like to backup because a known word list that I built up a couple of days ago has vanished - it's completely empty this morning.

(Oh - here, right? ".......\AppData\Local\ChineseTextAnalyser\wordlists\objects\known" )

 

Am I right that, unlike a few years ago, I can't add words into (effectively) the main CTA database any more? I found that was quite useful to overcome some of the segmentation limitations.

Link to post
Share on other sites
roddy

Too many pages ago to check, I suggested being able to jump to the next *unknown* instance of a word*, so when, say 巴 is shown as an unknown word, you can go directly to the instance in 巴塞罗那, say, to add that as a custom word, without having to click past all the instances of the known word 巴哈马. 

 

You'd said at the time it was tricky, and if it's not possible, so be it. But it would easily be my No. 1 feature request. I like to track those down as there's always something interesting there, but when you're dealing with long files it can be somewhere between time-consuming and impractical to find them. 

 

Also, I would pay a lot more for this product than you charge.

 

*9 times out of 10, it's a single character that indicates a new word or transliteration is abroad, or that there's a segmentation error. Also for this reason, I'm wary of marking single characters as known - sure,  I know them as single characters, but if I 'know' the characters 卡 扎 and 菲, the new word 卡扎菲 is going to pass me by. But there are obviously trade-offs, as you want to be able to say you know 卡 in the sense of card, etc...

Link to post
Share on other sites
imron
2 hours ago, realmayo said:

Am I right that, unlike a few years ago, I can't add words into (effectively) the main CTA database any more? I found that was quite useful to overcome some of the segmentation limitations.

You are not right.  You can easily add words to the main CTA database - custom words can be added (one per line in utf8 format) in ....ChineseTextAnalyser\data\words.u8 and you can add dictionary definitions in CC-CEDICT format to ChineseTextAnalyser\data\cedict_ts.u8.

 

If either of those files don't exist, just create them and save as utf8.

 

2 hours ago, realmayo said:

Where are existing wordlists saved? I'd like to backup because a known word list that I built up a couple of days ago has vanished - it's completely empty this morning.

(Oh - here, right? ".......\AppData\Local\ChineseTextAnalyser\wordlists\objects\known" )

Backup the entire wordlists directory, not just wordlists\objects\known.  There may also be other wordlists, and there is a bunch of other stuff there also.  Finally, if you don't have the latest version of CTA (0.99.18) there was a bug in a previous version where extra word lists might not get saved properly.

 

2 hours ago, roddy said:

Also, I would pay a lot more for this product than you charge.

If I had more time to work on it, I might implement all the features I have planned, and then I might charge more for it.

 

Until then, if you want to pay more, you can buy as many licenses as you like :mrgreen: 

 

2 hours ago, roddy said:

But it would easily be my No. 1 feature request.

I will also consider extra payment for specific feature requests :mrgreen:

Link to post
Share on other sites
realmayo
21 minutes ago, imron said:

You are not right. 

 

Delighted to hear it! On the hunt for my lost vocab list I came across a README file that told me "As of version 0.99.4, wordlist files are considerably more complex and manually adding/removing content from them is not supported".

 

But I now realise that "wordlist" is not the same thing as the dictionary database files - which in the past I happily edited all the time, and will now start doing so again.

Link to post
Share on other sites
imron

Right.  The wordlists are your lists of known/unknown words.  You can't directly edit them because CTA stores them internally as sets of added/removed words, in order to capture history (for an eventual graphing feature of learnt words over time).

 

The dictionary database files are separate from that, and can be modified freely.

Link to post
Share on other sites
  • 3 weeks later...
  • New Members
Borkie
On 1/27/2021 at 6:42 PM, realmayo said:

Surely if text has already been segmented - with spaces between words - then CTA will respect those words as words, except where it doesn't have the words in its database (and therefore breaks them down into individual characters)?

And if not, can you find + replace all the spaces with new lines? I believe CTA should respect those. 

Link to post
Share on other sites
  • New Members
黄有光

Does Chinese Text Analyser currently have a function to tell you how many words total the program currently considers "known"?  I would like to have an estimate of my current vocabulary size.

Link to post
Share on other sites

It's normally shown in the status bar on the bottom of the screen, however the status bar doesn't exist on all platforms (notably macOS).

 

Instead, you can choose Wordlists -> Manage from the menu, and then the dialog box will show a preview of each of your wordlists, with a summary saying how many words are in that wordlist.

Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...