Hmm. Here's what I did: I converted that file from GB encoding to UTF-8 using Notepad++. I called the converted file "buddhist.utf8.txt" and ran "segment.bat ctb buddhist.utf8.txt UTF-8 0 1> segmented.txt" from the Windows command prompt in a directory that contained both the decompressed word segmenter and the file "buddhist.utf8.txt". I didn't change the default settings, that is, allocating 2GB of RAM to the segmenter. This completed in 38 seconds on my Intel Core i5-2430M (2.4GHz) system, running Windows 7. The segmented text was written to a file called "segmented.txt" in that same directory.
I tried lowering the memory limit to 800MB, which should work for you. That significantly increased running time, from 38 seconds to slightly over 5 minutes, but that shouldn't be a problem I hope. Here's how to lower the memory limit: open "segment.bat" in Notepad++ and go to line 51, where it says "java -mx2g ...". Change this into "java -mx800m ..." (and don't touch anything else). Now if you run the same command, the segmenter will never try to use more than 800mb of memory. You can experiment with other settings, though if you go any lower than 400-500MB I doubt it will still run. But you'll just get an "out of memory" error, after which you can try again with slightly more memory.
If you want to use the Peking University training data rather than the Chinese Treebank model I used, just replace "ctb" with "pku" in the command. To change the memory allocated to a PKU session, go to line 41 in segment.bat and make the same change.
After the segmentation's done, you'll have a file called "segment.txt" which will be UTF-8 encoded, and which will contain spaces between words, so it should then be easy to calculate word frequency statistics. There seem to be some freeware tools around for this, which I haven't tried - I just ran a little script. I'm attaching a ZIP file containing the segmented text of your book, as well as a CSV file with the word frequencies that my script calculated. It's probably easier for you to try some of these freeware tools than to install the environment you'd need to run that script on Windows, but if you can't find any good tools, let me know and I'll get you a tutorial on how to get my script working
By the way, you can also supply additional lists of words to the word segmenter. So if you've looked through this particular frequency list, and determined that some words were correctly segmented, while others weren't, you can create a new text file containing the correct segmentations for these words. For example, you could write
and save that to a file called "extrawords.txt". Now if you want to tell the segmenter that in addition to the words it already knows and the heuristics for unknown words, it should also be aware that these words exist, all you need to do is edit "segment.bat" again and look for the same lines where you edited the memory allocation. A bit further down on the same lines, you should see "-serDictionary data/dict-chris6.ser.gz". Change this into "-serDictionary data/dict-chris6.ser.gz,extrawords.txt" and the segmenter should pick this up.
So if you should ever come across a list of Buddhist terminology, such as this one
or this one
, you can just put these terms in an extra dictionary, which should lead to a considerable increase in accuracy for these words.
Hope this helps a bit!