Jump to content
Chinese-Forums
  • Sign Up

Introducing Chinese Text Analyser


imron

Recommended Posts

2 hours ago, imron said:

That are not as fast.

 

I'd say it's better to have perfect segmentation at 5k words per second (Stanford segmenter, GPL) than blazing fast segmenter that makes single digit % errors. Why are you so much into extremely fast processing?  I chucked the whole 中华上下午天年 into Stanford segmenter and it took just over a minute. I suppose if someone processes a 100 full-length books a day, this might be slightly annoying, but approximately 100% of your users would be happy to wait a few seconds at startup in exchange for a perfect result. 

 

Quote

C:\Users\xxx\Desktop\stanford-segmenter-2018-02-27>segment.bat ctb ..\shangxia.txt UTF-8 0 >words.txt
CTB: Chinese Treebank segmentation
File: "..\shangxia.txt"
Encoding: "UTF-8"
kBest: "0"
-------------------------------
Invoked on Tue Sep 11 17:48:05 AEST 2018 with arguments: ...

...

CRFClassifier tagged 543985 words in 6101 documents at 4387.83 words per second.

 

And it did segment my example sentence perfectly (spaces):

没有 最大 限度 地 利用 已 有的 资源

 

Another cut and paste example, some HSK5 test IIRC. If you want to compare this result to CTA, remove segmenting spaces before pasting it.

Quote

把 电脑 回收站 清空 后 , 文件 是 不 是 就被 彻底 删除了 ? 答案 是否 定的 。 其实 那些 被 你 删除 的 文件 还 好好 地 放在 原来 的 位置 , 一 步 都 没 挪动 。 这 也 是 为什么 一些 文件 被 彻底 删除了 , 却 还 能 被 数据 恢复 软件 找 回来 的 原因 。
假如 你 对 某个 硬盘 的 全部 文件 都 执行 了 删除 命令 , 那么 这些 文件 立刻 就 都 消失 了 。 但 事实 上电脑 并没有 删除 它们 , 而是 做了 以下 的 事情 : 第一 , 将 这个 盘 的 文件 设为 不 显示 ; 第二 , 给 这个 盘 做 一 个 特殊 的 标记 —— 这个 盘 里 的 文件 全 都 没 用了 , 如果 要 储存 新 的 文件 可以 存放 到 这个 盘 里 来 。
如果 新 的 数据 存放 进去 后 , 完全 占 满 了 这个 盘 , 那 你 以前 的 文件 就 真的 彻底 没了 。 但 如果 你 删除 文件 后 , 一直 没有 新 文件 存入 , 那么 , 这些 被 删除 的 文件 就 会 永远 留在 原处 , 只不过 不 显示 而已 。 数据 恢复 软件 的 作用 就 是 让 它们 重新 显示 出来 。 电脑 之所以 这么 做 , 是 为了 提高 工作 效率 , 因为 让 电脑 真正 抹杀 一 份 数据 所 消耗 的 时间 是 很 长 的 , 如果 电脑 真的 如此 处理 删除 命令 , 估计 每个 用 电脑 的 人 都 会 疯掉 的 。

 

2 hours ago, imron said:

If you don't mind me asking, which software?

 

TBH (see above) when I need segmentation (=rarely), I use Stanford java tool. Once I have a file with spaced "words", I can do whatever I want, with a bit of scripting. I usually just test % of words I don't know in the book. And I must say CTA without babysitting it by adding fake words can be 5%+ off the real value, which is kind of important - 80+% understanding vs 90% is a big deal. (This babysitting is another pain in the *** - your policeman-like approach to looking up words doesn't make it easier to correct the tool's mistakes, but I digress.)

 

I checked my Windows VMs - there's Chinese Word Extractor and Chinese Toolbox, yet they are long dead and make exactly the same mistakes on my example sentence, so they are not a good example. I did play with them for a while last year tho, they must have been better at least on some texts. So, there is no Windows-specific segmenter I can recommend, I take that back. Java runs more or less everywhere.

Link to comment
Share on other sites

3 hours ago, uvwxyz said:

Why are you so much into extremely fast processing? 

It makes a big difference for responsiveness, especially with the real time highlighting of known/unknown words, especially if you have a large screen resolution and a full screen window full of text.  Assuming a refresh rate of 10 frames per second for a minimum level of responsiveness, 5,000 words/sec gets you 500 words per frame, or about double the amount of text you posted above.  That's far less than a screen of text, and that means you get lag in highlighting any time you scroll the screen or mark words as known/unknown.  That's something of a pet peeve of mine.  I hate it when my code editor flashes as it updates the syntax highlight of a much smaller file (happens all the time with Xcode) and there's not much excuse for it.  CTA will open any file instantly and will also update highlighting more or less instantly regardless of size of file or screen size, and that's something I care about.

 

3 hours ago, uvwxyz said:

I must say CTA without babysitting it by adding fake words can be 5%+ off the real value, which is kind of important

I use to think this way too, but part of the reason the segmenter hasn't been getting much love is because I found out that for many use cases it's not actually that important.  It stills gives accurate enough ballpark figures if you're trying to estimate the relative difficulty between two texts or if you're trying to find the words with the highest frequency.

 

That being said, I'd love to spend more time on the segmenter, and get it to the same level of correctness as the stanford one without sacrificing speed, but unfortunately I'm unlikely to have the time to do that anytime within the next few months.

 

3 hours ago, uvwxyz said:

Java runs more or less everywhere. 

Less more often than more on Macs.

Link to comment
Share on other sites

12 hours ago, imron said:

That's something of a pet peeve of mine.  I hate it when my code editor flashes as it updates the syntax highlight of a much smaller file (happens all the time with Xcode) and there's not much excuse for it

 

Ok, this explains.

Link to comment
Share on other sites

Windows: c:\users\<username>\AppData\Local\ChineseTextAnalyser\wordlists

macOS: ~/Library/Application Support/ChineseTextAnalyser/wordlists/

Linux: ~/.local/share/ChineseTextAnalyser/wordlists

 

You should be careful to only open it on one machine at a time, otherwise you might lose known words.  This is because it loads the wordlist at the beginning of a session and saves it out to disk when the application exits.

 

So if you open it on computer 1, and then make some changes, and then open it up on computer 2 and make some other changes and then close the app on computer 2 and then close the app on computer 1, the version left on disk at the end will be the version on computer 1.

Link to comment
Share on other sites

Hi Imron, thanks for the reply. I found the files and copied them over to the other PC. I will make sure not to have the files open on both machines at the same time.

 

It would be great though if you could make a future version that is portable (either all files, binary files + wordlists etc. all in the same folder OR the option to save the wordlists at a user defined location).

Link to comment
Share on other sites

Hi Imron, not sure whether you know this bug: if you increase the file size (I like really large characters), it does not activate the scrollbar, so you will end up with text that flows below the visible area of the text window and you have no way to scroll down, neither using mouse wheel nor the keyboard. Enlarge beyond a certain character size and you can simply not scroll to the end of the text, no matter how long.

 

Also: sometimes the automatic segmentation is wrong, e.g. for "不了解" it segments as "不了+解". Could you add a function to manually override this?

Link to comment
Share on other sites

  • 4 months later...

In Finder press Command-Shift-G and in the text box that appears type in:

 

~/Library/Application Support/ChineseTextAnalyser/

 

Copy (or zip up) this entire folder and copy it somewhere on your new machine - easiest is probably to a folder your desktop called ChineseTextAnalyser.

 

Then on your windows machine, install Chinese text analyser.  When it asks for your licence, there will be a copy in <Desktop>\ChineseTextAnalyser\cta.licence (assuming you copied the files to <Desktop>\ChineseTextAnalyser).

 

Now quit CTA.  It's important to run it once, as it will configure all the user directories for you.

 

Make sure to really quit, because CTA saves all it's config files upon exit, so if you overwrite them when you are running the program, then they'll be changed back to the old values when CTA quits.

 

Now open up C:\Users\<username>\AppData\Local\ChineseTextAnalyser

 

Delete the 'wordlists' folder, and then copy the entire 'wordlists' folder from your copy on the Desktop (<Desktop>\ChineseTextAnalyser\wordlists)

 

If you've modified the 'colour-schemes' at all, then do the same thing with the 'colour-schemes' folder.

 

Finally, to get any custom words and dictionary definitions, go in to <Desktop>\ChineseTextAnalyser\data and copy words.u8 and cedict_ts.u8 to 'C:\Users\<username>\AppData\Local\ChineseTextAnalyser\data' which won't have those files in it.

 

Then restart CTA and you should be good to go.

  • Helpful 1
Link to comment
Share on other sites

Thank you Imron! One more question--

 

Quote

When it asks for your licence, there will be a copy in \ChineseTextAnalyser\cta.licence (assuming you copied the files to \ChineseTextAnalyser).

 

I'm not sure I understand this sentence. Are you saying that when CTA asks for my license, I select cta.licence from the \ChineseTextAnalyser folder on my desktop?

Link to comment
Share on other sites

Yes, you can do that, or you can use the licence key originally sent to you via email - both will work because when you register CTA, it keeps an internal copy of your licence and that is where it puts it (so it will be an exact copy of the licence file you originally used to register).  Doing it this way just saves you hunting through your email archives for the licence key.

 

Technically speaking, you could just copy 'cta.licence' from the \ChineseTextAnalyser folder on your desktop to c:\users\<username>\AppData\Local\ChineseTextAnalyser and skip the entire registration process.  But going through the registration dialog lessens the likelihood of human error - if CTA can't find the internal copy of the licence in the right place, then it will fall back in to 'free trial' mode.

 

Link to comment
Share on other sites

TLDR: can you make it possible to select multiple word lists at once?

 

A while back, I mentioned that it would be nice to somehow keep track of how many times I've seen a word in different texts, because I rarely really know the word after learning it for just one text. I think at the time you said you'd try to figure out some way to do this. But I've since been worried that any approach used might be inefficient for some words and insufficient for others. Some words I really do know after just seeing them once (at least I know them well enough to read them - maybe I wouldn't be comfortable using them but that will come with time and more reading, not more srs reps), where as others I really do need to see them several times in several books to know them. So I think any approach that, for example, marks any exported word that has been exported 3 times as known, would be inefficient for words in the first case and insufficient for words in the second case.

 

Recently, though, I've thought I should have a "rotating" known words list. If I learn the words for book x, I add them to my "Known" list to make a "Known + x" list. Then I use that list to see what words I need to study for book y. After having studied book y, I have a "known + x + y" list. Then I use that word list to study book z, etc, etc. I think what I should do (although I haven't actually tried this yet) is to periodically remove word lists. So for example, let's say I have the word list "known + x + y + z + a  + b + c" for six books I've read. For my 7th book, I'd like to remove the words that I studied under book x, so I only have  "known + y + z + a  + b + c". Any words I come across that I feel I really know, I can then add to my "known" word list, which I never rotate out.

 

This can be done in the current version, but it would be made a whole lot easier if multiple word lists could be selected at once, so I didn't have to concatenate all my wordlists, then periodically un-concatenate the older ones, etc.

Link to comment
Share on other sites

I don't know if selecting multiple lists is the answer (which list do words get added to if you mark them as known?) but maybe allowing for easy manipulation of word lists would suffice, for example, allowing you to create wordlists by specifying an expression such as "known + a + b - z".

Link to comment
Share on other sites

34 minutes ago, imron said:

which list do words get added to if you mark them as known?

 

Could you select multiple word lists but privilege one over the rest, and it's that one that known words are added to (either upon export, or clicking in the document, etc.)?

 

36 minutes ago, imron said:

maybe allowing for easy manipulation of word lists would suffice, for example, allowing you to create wordlists by specifying an expression such as "known + a + b - z"

 

Yes, I think that would work!

Link to comment
Share on other sites

3 hours ago, Yadang said:

Could you.....

Everything is possible to do, but it's also a matter of trying not to make things too complicated and to me seleting multiple lists and allowing one to be privileged is started to feel a little complicated - mabye not to you and me because we're discussing it here and know what we are talking about, but to every other user who comes across it in the UI and doesn't have the benefit of this context.

 

3 hours ago, Yadang said:

Yes, I think that would work!

I think this is the much better thing to do - just make it easier to combine/subtract lists together in various ways so you don't have to do it manually.

  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...