Jump to content
Chinese-forums.com
Learn Chinese in China

imron

Introducing Chinese Text Analyser

Recommended Posts

imron

I'll add "Reader Mode" to my todo list.

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

icebear

Look forward to all of these when available. Great app, highly recommended to others!

Share this post


Link to post
Share on other sites
icebear

Request: allow me to configure highlight colors, especially in dark mode. Blue text on a black background is hard on the eyes!

Share this post


Link to post
Share on other sites
imron

You can do this, but it involves manually editing a config file.  If you're on windows, this will be located in:

 

c:\users\\AppLocal\Data\ChineseTextAnalyser\colour-schemes\default.colours

 

(there is a similar file on other OSes so let me know if you're on a different OS).

 

You can edit that file with any text editor, and the values are hex-colours with the # removed.

  • Like 1

Share this post


Link to post
Share on other sites
艾墨本

I'm trying to do an analysis of the 普通话水平测试 to determine how to go about learning all the 字. I want to see if I use my time learning all the 字 in the 60篇文章 will I then know most of the 字 that appear on the test. However, CTA uses 词 on the side on only says how many unique characters there are without allowing me to do analysis based on the characters. Is there any way to work this out?

《普通话水平测试用普通话词语表》.doc 普通话水平测试文章60篇.doc

  • Like 1
  • Good question! 1

Share this post


Link to post
Share on other sites
大块头

Is CTA meant to analyze things on the character level like that?

 

In any case, counting the characters in those essays only took a few lines of Python, if that's helpful to you. See the attached csv file.

char_count.csv

  • Like 1
  • Helpful 1

Share this post


Link to post
Share on other sites
艾墨本
30 minutes ago, 大块头 said:

Is CTA meant to analyze things on the character level like that?

 

In any case, counting the characters in those essays only took a few lines of Python, if that's helpful to you. See the attached csv file.

Thank you. Coding is definitely a language I wish I was more interested in. So useful.

 

But Yes, that's kind of what I'm looking for but not just the raw frequency of the characters. I'm looking to determine what % of characters would be covered if I learned all of the characters that show up in the 60 essays and vice versa (what percentage of the characters in the essays would be covered if I learned the list of words).

 

Would it be doable in python to check this?

Share this post


Link to post
Share on other sites
大块头

The essays contain 2293 unique characters. The word list contains 1668 unique characters. The intersection of these two sets contains 1307 characters.

 

I won't share my code just in case there is some way to make CTA do this. My intention isn't to cobble together some 山寨 version of one of its functions...

  • Helpful 2

Share this post


Link to post
Share on other sites
艾墨本

That's great info. Then I'm going to work on learning the 60 essays since that is more fun than the list and then learn the remaining 300+ characters after that. Might take a couple years, though.

 

Add this to my list of function requests for CTA @imron

Share this post


Link to post
Share on other sites
大块头
17 minutes ago, 艾墨本 said:

I'm going to work on learning the 60 essays

 

Sounds like a great use case for CTA!

  • Like 1

Share this post


Link to post
Share on other sites
LinZhenPu

@艾墨本

Are you going to one day take the Putonghua test that mainland Chinese people take? 😮

Share this post


Link to post
Share on other sites
艾墨本
3 hours ago, LinZhenPu said:

Are you going to one day take the Putonghua test that mainland Chinese people take? 😮

That's my goal. I started working through it last year and got side tracked with COVID. Four of the essays down, 56 to go. But I'm also focusing on quality over quantity (though quantity will be needed eventually) making sure I can properly recite each line in a "story telling" fashion. Even after learning just four of them with my tutor (Shout out to @GoEastMandarin) I saw an enormous amount of growth.

 

CTA helps me determine which words to focus on.

 

Share this post


Link to post
Share on other sites
imron
22 hours ago, 大块头 said:

I won't share my code just in case there is some way to make CTA do this. My intention isn't to cobble together some 山寨 version of one of its functions...

Thanks for the consideration, but I generally follow a philosophy that more is better than less, so regardless of whether or not CTA can do this, please feel free to share source code or tools that other people might find useful (but maybe start a new thread, to keep this one just about CTA).

 

That being said, CTA intentionally focuses on 词 rather than 字 and doesn't have this feature built in.  I've considered adding it, but am still in two minds about it.

 

However, what CTA does have is Lua scripting support, and in that sense you can make CTA do whatever you want.  For example, here is a script that counts the number of unknown characters in a document.  It would be trivial to modify that script to count all characters, just change line 47 from this

 

        if charType == "Chinese" and knownChars[char] == nil then

 

to this

 

        if charType == "Chinese" then

 

And with a bit of effort, it could also be made to calculate the % coverage of a document with a given word list - in fact there is already a script that ships with CTA (char-coverage.lua) that does this for HSK6 coverage of a given document.

 

@大块头, if you don't want to tread on CTA's toes, feel free to make Lua script versions of any scripts and post them in that other thread :mrgreen:

Share this post


Link to post
Share on other sites
大块头
1 hour ago, imron said:

@大块头, if you don't want to tread on CTA's toes, feel free to make Lua script versions of any scripts and post them in that other thread :mrgreen:

 

生活苦短,我用Python。:wink:

Share this post


Link to post
Share on other sites
philwhite
On 8/10/2020 at 2:57 AM, 大块头 said:

生活苦短,我用Python。

 

生活苦短,我用bash,

echo 'Unknown char count:'

comm -13 sortedknowncharlist <(cat file | sed 's/\(.\)/\1\n/g' | sort | uniq) | wc -l

  • Like 2

Share this post


Link to post
Share on other sites
大块头

For every 100 of us snot-nosed brats scribbling on whiteboards and typing at our fancy-schmancy IDEs, there is some UNIX wizard who is sipping coffee and browsing Usenet because they've already solved the problem with a bash one-liner. :mrgreen:

  • Like 2

Share this post


Link to post
Share on other sites
icebear
On 8/7/2020 at 5:56 PM, imron said:

You can do this, but it involves manually editing a config file. 

Thanks, worked like a charm!

  • Like 1

Share this post


Link to post
Share on other sites
philwhite
On 8/10/2020 at 2:50 PM, philwhite said:

comm -13 sortedknowncharlist <(cat file | sed 's/\(.\)/\1\n/g' | sort | uniq) | wc -l

TMTOWTDI:

comm -13 sortedknowncharlist <(sed 's/./&\n/g' file | sort -u) | wc -l

 

  • Like 2

Share this post


Link to post
Share on other sites
imron
On 8/12/2020 at 4:04 PM, icebear said:

Thanks, worked like a charm!

What colours did you end up using?

 

15 hours ago, philwhite said:

comm -13 sortedknowncharlist <(sed 's/./&\n/g' file | sort -u) | wc -l

Stray cats are a continual problem with unix one-liners 😉

  • Like 1

Share this post


Link to post
Share on other sites
mungouk

I bought CTA a while back, and to be honest have only used it once or twice, to analyse HSK levels of ebooks or similar.

 

I think what I'm missing is some good descriptions of use-cases and tutorials to show what it's capable of, and how I could be using it.

 

Are there any examples out there already on, say, youtube? 

If not, do any of you power-users feel like explaining how you use it to do things you couldn't do with other tools?

 

I guess I'm not the only one who could benefit from your collective wisdom.

 

Cheers!

 

 

  • Like 1
  • Good question! 2

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...