List of word frequency

June 17, 2012 at 03:46 AM

small update: catabunga's python script works really well and amazingly fast even on my windows netbook. I didn't know any other way to sort the resulting file but with Excel, which was the slowest part of it all. Now I wonder how to start using this sorted list for flashcard study. I'd love a way to tell Pleco (or any other flashcard app) to focus on the most frequent words from the top of the list, ignoring the items I have already learned (which can be seen from the internal statistics of the flashcard app anyway). Maybe I just cut it into smaller parts, first using the most frequent 100 and so on. I'll just give it a try and see how it works.

August 22, 2013 at 04:09 PM

old topic, but I saw this Frequency List on the web. Its a Word and Character Frequency List "based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words)"

Seems like a smart way to complie a list, nothing the slant of the corpus

January 2, 2016 at 05:57 PM

Thanks for these resources! I am using subtlex-ch and the two frequency lists available here http://corpus.leeds.ac.uk/list.html now to decide which words to study next.

Slightly off-topic, but if you're interested, you can also check out the linguistically annotated corpus of Sina Weibo messages I built

Thanks I've downloaded it. It looks like there are some 400000 missing lines? I've left a message on your contact form as well.

I'm trying to make the Stanford Segmenter work. I know almost nothing of these kinds of things so it's not going smoothly. I've tried copying the files to my X:\Cygwin64__LinuxOnWindows\bin folder and running this in CygWin, but I get these error messages. Editing segment.sh to change the memory requirements for Java, as suggested by Daan above, didn't seem to work.

$ segment.sh ctb test.simp.utf UTF-8
Usage: /usr/bin/segment.sh [ctb|pku] filename encoding kBest
  ctb : use Chinese Treebank segmentation
  pku : Beijing University segmentation
  kBest: print kBest best segmenations; 0 means kBest mode is off.

Example: /usr/bin/segment.sh ctb test.simp.utf8 UTF-8 0
Example: /usr/bin/segment.sh pku test.simp.utf8 UTF-8 0



Another attempt, with a "0" added at the end:



$ segment.sh ctb test.simp.utf UTF-8 0
(CTB):
File: test.simp.utf
Encoding: UTF-8
-------------------------------
Error occurred during initialization of VM
Could not reserve enough space for 2097152KB object heap

Sign In

List of word frequency

Recommended Posts

yaokong

Link to comment

Share on other sites

Johnny20270

Link to comment

Share on other sites

Rowley

Link to comment

Share on other sites

Join the conversation