Jump to content
Chinese-Forums
  • Sign Up

List of word frequency


roddy

Recommended Posts

small update: catabunga's python script works really well and amazingly fast even on my windows netbook. I didn't know any other way to sort the resulting file but with Excel, which was the slowest part of it all. Now I wonder how to start using this sorted list for flashcard study. I'd love a way to tell Pleco (or any other flashcard app) to focus on the most frequent words from the top of the list, ignoring the items I have already learned (which can be seen from the internal statistics of the flashcard app anyway). Maybe I just cut it into smaller parts, first using the most frequent 100 and so on. I'll just give it a try and see how it works.

Link to comment
Share on other sites

  • 1 year later...
  • 2 years later...

Thanks for these resources! I am using subtlex-ch and the two frequency lists available here http://corpus.leeds.ac.uk/list.html now to decide which words to study next.

 

 

 

Slightly off-topic, but if you're interested, you can also check out the linguistically annotated corpus of Sina Weibo messages I built :)

 

Thanks :) I've downloaded it. It looks like there are some 400000 missing lines? I've left a message on your contact form as well.

 

 

I'm trying to make the Stanford Segmenter work. I know almost nothing of these kinds of things so it's not going smoothly. I've tried copying the files to my X:\Cygwin64__LinuxOnWindows\bin folder and running this in CygWin, but I get these error messages. Editing segment.sh to change the memory requirements for Java, as suggested by Daan above, didn't seem to work.

$ segment.sh ctb test.simp.utf UTF-8
Usage: /usr/bin/segment.sh [ctb|pku] filename encoding kBest
  ctb : use Chinese Treebank segmentation
  pku : Beijing University segmentation
  kBest: print kBest best segmenations; 0 means kBest mode is off.

Example: /usr/bin/segment.sh ctb test.simp.utf8 UTF-8 0
Example: /usr/bin/segment.sh pku test.simp.utf8 UTF-8 0



Another attempt, with a "0" added at the end:



$ segment.sh ctb test.simp.utf UTF-8 0
(CTB):
File: test.simp.utf
Encoding: UTF-8
-------------------------------
Error occurred during initialization of VM
Could not reserve enough space for 2097152KB object heap

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...