Jump to content
Chinese-Forums
  • Sign Up

Lua Script and Chinese Text Analyser


imron

Recommended Posts

On 3/31/2022 at 3:32 PM, yaokong said:

With your LUA script I just processed 161 books in 11 seconds

CTA was built with this sort of use case in mind - not the Lua scripting per se, more the analyzing a bunch of books and figuring out which one is going to be the most suitable to read.

 

It's the sort of stuff I'd like to incorporate in to the program itself, but I added Lua scripting as a stop gap because I don't have as much time as I'd like to work on CTA, and the scripting allows for a relatively easy way to extend the program's functionality without needing to wait for a new release.

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...

@imron, I found a bug in this script, it seems to skip filenames and folder names with Chinese characters, is that expected? I assume not, since we are talking about scripting CTA, literally built for processing Chinese texts. 

 

The folder I am testing on is "_YiXi - TED Talks of China" from Chinese Transcripts. The script processes 6 files out of 647 (results attached), and none of those 6 have Chinese characters in the filename. The ones with Chinese characters are skipped entirely. 

 

I just tried again and failed even with some texts with English filenames, this time it was the folder "华灯初上\Season 1\Plain Text - ZH Simplified", also from Chinese Transcripts. The result is empty, it only contains the header line. If I move those files to another folder that has no parent folder with Chinese characters, then the files are processed just fine. 

 

If it is too much work to fix, don't worry about it, I will find some temporary workaround, like renaming all the files in bulk.

_YiXi - TED Talks of China_knownWordsLUA.txt 华灯初上--Season 1--Plain Text - ZH Simplified_knownWordsLUA.txt

  • Like 1
Link to comment
Share on other sites

These days, I am generating transcripts for the videos on this YouTube channel.

https://www.youtube.com/c/Lindsay说

 

With the transcripts, I try to identify and make Anki cards for idiomatic expressions such as chengyus, using the lua script attached below.

 

 

At the moment, when I run the Lua script to generate Anki cards, only the Anki cards that have entries "not existing" in my Anki cards get imported into Anki. That is to say, if there already exists an Anki card for 一知半解 and the "newly" generated Anki cards also contains a card for 一知半解, that new card doesn't get imported. As a result, at the moment, I am restricted to having only one example sentence for one entry. Not bad for now, but I think having multiple example sentences could be useful and I noticed that some chengyus are used in multiple videos. After I am done generating transcripts for almost all the videos on that channel, I am thinking about compiling texts and extracting example sentences across the texts. 

 

Is it possible to write a Lua scripts that checks all the texts and extract sentences that uses a certain expression?

 

Attached is the lua script that I currently use. It generates only one example sentence per entry(the word that I highlighted unknown).

anki-export.lua

  • Like 2
Link to comment
Share on other sites

On 4/9/2022 at 12:22 PM, yaokong said:

is that expected?

It's unexpected.  The script is just getting all files in all directories.  Not sure why files and directories with Chinese characters are being excluded.  I don't have time to look in to it at the moment, so maybe go the workaround route for now.

 

Edit: using filenames with chinese characters works for me on macos.

 

Link to comment
Share on other sites

On 4/10/2022 at 12:18 PM, imron said:

using filenames with chinese characters works for me on macos.

you gave me an idea: just tested on my Linux machine and here it works just fine. I was using my wife's Windows laptop yesterday, the bug only occurred there. Maybe I need to change the <forgot (the exact term)> non-unicode Windows region codepage </forgot> to Chinese.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...