Jump to content
Chinese-Forums
  • Sign Up

New tool for vocabulary extraction


c_redman

Recommended Posts

Thank you sir. Your online tool is awesome - I especially like the log-frequency statistic it provides.

We've had a few tools somewhat like this before, and one useful feature is the ability to exclude/ignore known words or characters. It looks like your tool has this as well, and thus it almost supersedes the existing tools. Very impressive.

Link to comment
Share on other sites

Thank you sir. Your online tool is awesome - I especially like the log-frequency statistic it provides

Unfortunately, I wasn't able to include the word frequency statistics from the Lancaster Corpus in this program. I use it personally, but it wasn't clear from their distribution license that I could include it, so I erred on the side of caution. If there is an obviously free word frequency list, I can include it. Meanwhile, anyone can easily add in any statistics they have access to.

Character frequency lists, however, are plentiful. I just wanted to get this basic program out first, and add that data the first chance I get.

Link to comment
Share on other sites

Unfortunately, I wasn't able to include the word frequency statistics from the Lancaster Corpus in this program. I use it personally, but it wasn't clear from their distribution license that I could include it, so I erred on the side of caution.

Would it be possible to upload the column of words extracted from the analysis using the off-line program and have your site generate the word frequency statistics (in the same order so that it could be added back as a new column)

Link to comment
Share on other sites

I have looked up the Lancaster Corpus online and read the current license page. (http://www.ota.ox.ac.uk/scripts/download.php?otaid=2474)

I believe that if you add a provision for people to get their own copy (by submitting an email address and agreeing to their terms for private use), that you could incorporate the ability to use it, if present.

This model is used by many other software programs.

Link to comment
Share on other sites

I got this working on Ubuntu Linux, although it's not a tidy package at this point. Here are the commands I used:

lsb_release -a
# No LSB modules are available.
# Distributor ID: Ubuntu
# Description: Ubuntu 11.04
# Release: 11.04
# Codename: natty
wget http://apt.wxwidgets.org/key.asc -O - | sudo apt-key add -
# if it waits without going back to the prompt, it's waiting for the sudo password
sudo nano /etc/apt/sources.list
# replace "natty" with appropriate codename if different Ubuntu version
# # wxWidgets/wxPython repository at apt.wxwidgets.org
# deb http://apt.wxwidgets.org/ natty-wx main
# deb-src http://apt.wxwidgets.org/ natty-wx main
sudo apt-get update
sudo apt-get install python-wxgtk2.8 python-wxtools wx2.8-i18n
sudo apt-get install subversion
sudo apt-get install python-chardet
mkdir ~/CWE
cd ~/CWE
svn export http://svn.zhtoolkit.com/ChineseWordExtractor/trunk/
cd trunk
python main.py &

post-12715-0-09221300-1317651197_thumb.png

Link to comment
Share on other sites

  • 2 weeks later...

Hello c_redman,

First, thanks a lot for your very useful tool. I've downloaded and tried the Windows version, which works flawlessly.

I would like to use it in my never ending quest for reading material that is slightly above my current level.

For that I tried to use the new HSK4 list as a filter, in the hope that I could find texts that contain relatively few words not included in that list. However, it turns out to be harder than expected, as there are quite a lot of very simple words that are not included in the HSK4 list, either because they are too simple by themselves (天, 手…) or because they are transparent expressions or compounds (第一次, 跑出去…). So I'm trying to compile a list of such words but it might take a long time... Any idea to simplify this? Maybe a function which would allow excluding all words whose frequency per million is above x?

Also, there is a feature of the online tool which I don't find in the local version: the meanU(log10). Yes, it's only a very rough estimate, but I have tested it with a range of simple to intermediate and difficult texts, and it gives a useful indication of the degree of difficulty. Any plan to incorporate it the harddrive version?

On the other hand, I could actually *read* the text and see for myself whether it's difficult or not :-). Anyway, just a few questions... Thanks again,

Laurenth

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...