Jump to content
Chinese-Forums
  • Sign Up

List of word frequency


roddy

Recommended Posts

c_redman: 不但 . . . . . . . . . . . . 而且

Nice homework. For the English examples check the CQP box and drop the ; at the end of the examples. It automatically tags a count limit and the ; cause a premature end of syntax.

xiele,

Jim

Link to comment
Share on other sites

  • 2 months later...

Jun Da in his Chinese Text Computing gives a list of Bigrams (two characters)in frequency order. However, these were compiled by programs and there are some errors. The link is

http://lingua.mtsu.edu/chinese-computing/

If you are trying to improve you Chinese vocabulary, a better way might be to get the lists of vocabulary that are used in test of Chinese proficiency in the PRC (HSK) and in Taiwan (TOP). They are graded into levels such as Beginning, Intermediate, Advanced. The HSK list can be found at

http://en.wiktionary.org/wiki/Category:Chinese_language

and the TOP list can be downloaded under the reference links on

http://www.sc-top.org.tw/english/download.php

Hope this helps your study.

Link to comment
Share on other sites

  • 1 year later...

Is there any way to generate such word (not character) frequency lists for individual texts?

I have quite a lot of (Buddhist) documents with rare words and names not appearing in any dictionary. I would like to learn the most common expressions before trying to read those texts.

It's fine if the list contains mistakes (due to low mutual information), I think I would be able to recognize those.

I found some scripts here and there, which are supposed to do this, but could not run them successfully.

- Stanford Chinese Segmenter for example gives error message Could not reserve enough space for object heap Could not create the Java virtual machine.

- SegTag by 史晓东 crashes on launch.

Such a tool would be a great help for all those who learn non-standard, less frequent parts of Chinese! Well, actually for anybody trying to read a document with unknown words. It would be very useful to find out what are the most common words and sort out the unknown ones for study.

Link to comment
Share on other sites

I've done some work with the Stanford Chinese Word segmenter, but I've always been able to fix such errors by allocating a bit more memory. If you're running the segmenter on a Linux system, try this:

/path/to/segmenter/segment.sh ctb unsegmented.txt UTF-8 0 > segmented.txt 2> /dev/null

Your input file should be encoded in UTF-8. If this throws any errors, you can edit segment.sh to increase the amount of memory allocated to Java: something like 2gb be more than enough. If that doesn't cut it, you may want to split your input file into a few smaller files.

The procedure should be the same on Windows, only you'd be looking at segment.bat, I think. But I haven't tried using this tool on Windows.

By the way, you should know that this segmenter will only add word boundaries to your text: it won't calculate these frequency statistics for you. But that's easily solved with a few lines of code, which I'd be happy to write for you if you don't know how to. I should still have that code lying around somewhere anway; I used it to compile this frequency list of the words used in Sina Weibo messages.

Slightly off-topic, but if you're interested, you can also check out the linguistically annotated corpus of Sina Weibo messages I built :)

Link to comment
Share on other sites

Daan, thanks for the answer, unfortunately I run the script on a Windows netbook with about 1 GB of free RAM. I first tried a whole book (few hundred kB), then just a short chapter (6 kB) but still couldn't run the segmenter. I have no access to more powerful (and memoryful) computers, so I am a bit lost :-( Is there no other program which needs less RAM? I could wait days, until its ready calculating :-) It can also be a mac app, but also my mac has only 1-1.2 GB of free memory.

Link to comment
Share on other sites

I'd guess the online version either drops them or lists them as having zero frequency (see end of list). It will list their characters still...

What you can do is download the executable file from that website and then add a dictionary specific to the type of material you are reading - its a very easy process (putting a tab-delimited UTF-8 dictionary text file into a folder).

Link to comment
Share on other sites

BertR, sure - just go to the "open access" page and download the words file :)

yaokong, could you attach that file to a post here? I'll see whether I can get it to work with just 1gb of RAM. May help if I tweak the settings a bit :)

Link to comment
Share on other sites

icebear, that's exactly the point: no dictionary that I know of contains many rare expressions and transliterated terms and names (from Sanskrit, Pali, Tibetan etc.) in these documents. That is the main reason I'd like to extract them in the first place. I could then do the research on these when I have the time for it, and even add them as dictionary entries to Pleco, and later read them (even away from all research possibilities), knowing what these terms mean and recognizing the (sometimes bizarre transliterations of) names.

Daan, thanks! Please try the following (rather large) book (you can cut out a smaller piece of it). Let me know how it went, and if you have any ideas how I could do it by myself in future (on my Windows netbook or rather old Macbook, both with only 2GB of memory).

http://www.sendspace.com/file/4d2rpl

or

uploading.com

Link to comment
Share on other sites

Hmm. Here's what I did: I converted that file from GB encoding to UTF-8 using Notepad++. I called the converted file "buddhist.utf8.txt" and ran "segment.bat ctb buddhist.utf8.txt UTF-8 0 1> segmented.txt" from the Windows command prompt in a directory that contained both the decompressed word segmenter and the file "buddhist.utf8.txt". I didn't change the default settings, that is, allocating 2GB of RAM to the segmenter. This completed in 38 seconds on my Intel Core i5-2430M (2.4GHz) system, running Windows 7. The segmented text was written to a file called "segmented.txt" in that same directory.

I tried lowering the memory limit to 800MB, which should work for you. That significantly increased running time, from 38 seconds to slightly over 5 minutes, but that shouldn't be a problem I hope. Here's how to lower the memory limit: open "segment.bat" in Notepad++ and go to line 51, where it says "java -mx2g ...". Change this into "java -mx800m ..." (and don't touch anything else). Now if you run the same command, the segmenter will never try to use more than 800mb of memory. You can experiment with other settings, though if you go any lower than 400-500MB I doubt it will still run. But you'll just get an "out of memory" error, after which you can try again with slightly more memory.

If you want to use the Peking University training data rather than the Chinese Treebank model I used, just replace "ctb" with "pku" in the command. To change the memory allocated to a PKU session, go to line 41 in segment.bat and make the same change.

After the segmentation's done, you'll have a file called "segment.txt" which will be UTF-8 encoded, and which will contain spaces between words, so it should then be easy to calculate word frequency statistics. There seem to be some freeware tools around for this, which I haven't tried - I just ran a little script. I'm attaching a ZIP file containing the segmented text of your book, as well as a CSV file with the word frequencies that my script calculated. It's probably easier for you to try some of these freeware tools than to install the environment you'd need to run that script on Windows, but if you can't find any good tools, let me know and I'll get you a tutorial on how to get my script working :)

By the way, you can also supply additional lists of words to the word segmenter. So if you've looked through this particular frequency list, and determined that some words were correctly segmented, while others weren't, you can create a new text file containing the correct segmentations for these words. For example, you could write

阿弥陀佛

释迦牟尼

and save that to a file called "extrawords.txt". Now if you want to tell the segmenter that in addition to the words it already knows and the heuristics for unknown words, it should also be aware that these words exist, all you need to do is edit "segment.bat" again and look for the same lines where you edited the memory allocation. A bit further down on the same lines, you should see "-serDictionary data/dict-chris6.ser.gz". Change this into "-serDictionary data/dict-chris6.ser.gz,extrawords.txt" and the segmenter should pick this up.

So if you should ever come across a list of Buddhist terminology, such as this one or this one, you can just put these terms in an extra dictionary, which should lead to a considerable increase in accuracy for these words.

Hope this helps a bit!

Buddhist.zip

  • Like 2
Link to comment
Share on other sites

Here's a text file with all words from the second dictionary I linked to. There's a link to an XML file containing the dictionary on that web site, but that link was dead, unfortunately, so I wrote a quick script to get the words from the HTML file instead. I haven't checked the entire file manually, so there may be some errors, but I think this should do nicely. It's got 16,312 Buddhist terms, so if you use this as an extra dictionary, you should be in good shape. Only problem is that this is in traditional characters and your texts are in simplified characters, so you'll want to convert this file into simplified characters before using it. But I guess there are plenty of tools available online to help you do that :)

The dictionary is available under a Creative Commons license, so redistributing this list here should be fine. Someone even turned it into a custom Pleco dictionary, including Pinyin...you may want to write to him to see if you can get that file, should be useful :)

Happy studies!

buddhistdict.txt

Link to comment
Share on other sites

that's exactly the point: no dictionary that I know of contains many rare expressions and transliterated terms and names (from Sanskrit, Pali, Tibetan etc.) in these documents.

I use the tool mentioned before at : http://www.zhtoolkit...rd%20Extractor/ and that gives the characters and mentions unknown instead of a translation when it doesn't know the character. You still get the frequency.

I first ran it on a segment of your text and a quick scan didn't reveal any unknowns. Then I ran it on the entire text which took a while, and it came up with only 13 unknowns, most of them very low frequency. The tool however does apply a filter as western script and special characters are not shown. So there is a possibility that rare characters are not recognised as chinese and filtered out. To me, at first sight, however it looks like it does the trick.

Link to comment
Share on other sites

Silent, that's an interesting tool. I hadn't seen it before, thanks for sharing the link. Unfortunately, the documentation states that:

The method for segmenting the words is simply to find the longest match within the dictionaries loaded into the program. This works well in general, but fails for some character combinations by splitting in the wrong place. it also fails to make words out of terms that aren't in the dictionary, and instead treats these as a series of single characters.

I'm afraid that won't work very well if you're segmenting text with lots of terms that aren't in CC-CEDICT, which is the dictionary this tool uses by default. If the documentation is correct, unknown words will not be recognised by this tool - it'll just split them into single characters, which are indeed likely to be in CC-CEDICT and will thus be considered known. The Stanford word segmenter actively tries to recognise unknown words, and it's also a bit smarter in that it doesn't simply look for the longest match: it considers the entire context when deciding how to segment a bit of text.

I haven't tried the tool myself, though, so perhaps the docs are outdated :)

Link to comment
Share on other sites

Daan, thanks a lot, with your help I could now run the Stanford segmenter. I will have to experiment with it though, for some strange reason, it does not recognize quite some words, even ones which appear many-many times in the document.

Also thanks for the link of the Soothill dictionary, I had that already, I in fact converted the DDB wordlist into another Pleco-friendly dictionary, available on that same forum. I will try the segmenter again, using all these wordlists, plus my self created ones.

Silent, I think Daan explained already, we were not talking about unknown characters, but words, bigrams, n-grams, character combinations that is. Such as 明点, which often appears in the document I attached and is not recognized as a word by most programs, and also appears in very few dictionaries only. But I made contact with the developer of that tool, I will also play around with that (using additional wordlists of course).

So, thanks everybody for the help! I will experiment with these tools!

Link to comment
Share on other sites

You're welcome. Let us know how you get on - this is bound to be interesting for people trying to read texts containing lots of specialised vocabulary, not necessarily Buddhist words but also, say, tech terms.

Link to comment
Share on other sites

  • 3 weeks later...

Not sure if that's exactly what you need, but here I prepared a list of N-grams from your text. The attached file contains all N-grams 1 to 8 character long, which appear in the text more then four times. Right column is the number of occurrences. Encoding UTF-8.

狂密与真密-n-grams.txt

Link to comment
Share on other sites

cababunga, well, thank you very much, I think that is the best list I got so far, and I have tried quite some approaches. Please be so kind and let as know how you generated it.

What I would like to finally achieve, and I really think this would be beneficial for many (or even most) learners, is to create an easy method to produce such document-based word frequency lists and let them be automatically be compared to one's own character/word learning progress, e.g. based on flashcard testing statistics. This way, when encountering a new document/book/movie (subtitle), one could study the most common (yet unlearned) words before trying to read the given text. I have practically no knowledge of programming, so putting this idea into practice would have to rely on others who do :-)

btw, the link in your comment's footnote is actually two links, don't know if that was your intention

Link to comment
Share on other sites

It's really all DIY stuff and I don't see how I can make it easy to use without spending a lot of time, which I don't have these days. You still can try it if you want.

This file is a Python 2 script that does the n-gram extraction. Given large input it will produce series of large files named freq-01.txt, freq-02.txt and so on. Files are unordered and suppose to be sorted alphabetically, merged and then multiple entries for same n-gram reduced to one. Then you have to get rid of infrequent entries and sort the result by the n-gram frequency. The tool was originally used for processing large corpus. For small inputs like in your case, you will just have one small file, which makes your life much easier. Just sort it by frequency and truncate it some point. In Unix-like system you can do it by running

sort -rnk2 freq-01.txt | head -20000 >n-gram.txt

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...