Jump to content
Chinese-Forums
  • Sign Up

Number of characters per pinyin syllable


tresgog

Recommended Posts

Hi,

I hope this thread belongs to the right category.

I would to like to find a chart/table with two entries. One would be a given pinyin, let's say "yi" (for now , without any tones specification), the other one would be the number of characters corresponding to this pinyin.

I've already found many online sources in which if you enter a pinyin, they'll give you a list of characters (with sometimes even the number of characters). I tried to construct this chart myself out of the data given. But the problem is there are some many different pinyin that it takes forever to have the final chart.

Does someone already know where to find that kind of chart?

Here are some sources:

1)mdbg: http://us.mdbg.net/chindict/chindict.php?page=chardict&cdcanoce=0&cdqchi=yi%0D%0A&cddmtm=2&cddytm=0

for instance by typing "yi" I get 250 characters.

2) a freeware called "BabelMap_汉化版". it is a very exhanstif "map" of all chinese characters (maybe not all). In this software if I look for the pinyin "yi" I got 604 characters! But this tool given redundant answers because it includes both traditional and simplified characters. I only want a chart (or statistics) in a given set of character. Secondly, Babelmap also includes non-mandarin pinyin: if I ask for "ng" i'll give me the cantonese "唔“ (which is, I think, "wu" in mandarin pinyin)

But anyway some sources are great but it takes too long to made out the chart that is why I'm looking for an already made chart.

Thanks for reading

Link to comment
Share on other sites

I can see how you could automate this, but you'd need some database skills (or perhaps just Excel even?).

Get the Unihan database. Throw out everything that doesn't have a GB2312 entry. That cuts it down to 6000+ simplified characters, so we're not going to be swamped. Or there may be a GB2312 specific file somewhere. Then do some kind of search and replace to remove the numbers from YI3, WANG2, etc. Then run a query to find how many of each entry there are in the pronunciation column.

Having exhausted myself with this high-level conceptual work, I shall leave the actually implementation to others.

Link to comment
Share on other sites

Thanks for the prompts answers. A few question regarding the previous posts.

1) Is Wenlin a freeware? At least get I get this functionnality in the Free version (if there is).

2)how to edit/manage/sort "throw out" this unicode file? (this questions is for the others)

Link to comment
Share on other sites

Thank for your suggestions.

Although, I would prefer a direct answer to my query. I don't mind learning those requires skills in database to answer my question (I am sure it can be useful in other fields).

If someones knows about the "grailic" list/chart a link would be most welcome.

In the hypothesis that such a list does not exist, it will be my honor to generate it. In such case, I need someone to show me the way.

I downloaded the Unihan list, it is very exhaustif there are some characters witch are not even in standard well-know chinese directionnary (espcially vulgar terms). I am very impressed.

How can I differentiate simplified from traditional? How is the GB2312 coded.

If I take a given unicode characters it does not seems to have any label mentioning this GB2312. For instance “汉":

U+6C49 kCCCII 224672

U+6C49 kCangjie EE

U+6C49 kCantonese hon3

U+6C49 kDefinition Chinese people; Chinese language

U+6C49 kEACC 274857

U+6C49 kFourCornerCode 3714

U+6C49 kFrequency 3

U+6C49 kGB0 2626

U+6C49 kHKSCS FAE4

U+6C49 kHanYu 31549.080

U+6C49 kHanyuPinlu han4(227)

U+6C49 kIICore 2.1

U+6C49 kIRGHanyuDaZidian 31549.080

U+6C49 kIRGKangXi 0604.091

U+6C49 kIRG_GSource 0-3A3A

U+6C49 kIRG_HSource FAE4

U+6C49 kIRG_TSource F-2166

U+6C49 kKangXi 0604.091

U+6C49 kMainlandTelegraph 3352

U+6C49 kMandarin YI4 HAN4

U+6C49 kMorohashi 99999

U+6C49 kPhonetic 546

U+6C49 kRSKangXi 85.2

U+6C49 kRSUnicode 85.2

U+6C49 kTotalStrokes 5

U+6C49 kTraditionalVariant U+6F22

U+6C49 kXHC1983 0441.040:ha虁n

I don't see anything that involve "GB2312" appart form the fact that I know that is a PRC's character.

Basically, how to "throw out" the non-BG2312 character? (or the GB/T 12345 because I am more interested in traditional characters for now)

NB: what is the label 227 next to han4?

Link to comment
Share on other sites

Hey,

You could use the CC-CEDICT list. If you're on Linux:

wget http://www.mdbg.net/chindict/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz

Gunzip it and using sed or VIM (or your favorite text editor), you could extract just the one-character entries. The commands I used in VIM:

#delete comments

:g/^#/d

#delete multiple-character entries

:g/^[^ ]{2}/d

#delete left bracket

:%s/[//

#right bracket

:%s/]//

#delete explanation

:%s//.*/

#convert everything to lowercase

:%s/[A-Z]/L&/

And then, in a bash-like shell:

awk '{pinyinCount[$3]++} END{for(i in pinyinCount) print i, pinyinCount}' input_file.txt |sort -r -k 2nr > pinyin_count.txt

This got me something like the following:

me@you:/tmp$ head pinyin_count.txt

yi4 88

yu4 72

xi1 69

bi4 68

li4 65

yu2 61

ji4 57

zhi4 56

fu2 54

qi2 51

Link to comment
Share on other sites

How to generate the "pinyin vs count" chart using Wenlin?

Question less related to the topic: How to update Wenlin? Because it seems that it is not absolutely complete (Even if it seems to be a very powerful software). For instance, I cannot find the character which unicode is U+216A6 (Caution: it is vulgar). some other database has definition and stuff about this character: eg. http://www.zdic.net/zd/zi3/ZdicF0ZdicA1Zdic9AZdicA6.htm

(sorry I picked this one but that was the only example I found...)

Link to comment
Share on other sites

Thank you cababunga, i only noticed now that you have answered my question.

Could you give the details how to re-obtain the list? So I can tune some parameters and and some filters (simplified/traditional, frequency ect...) and add the different tones

I think this list is intrinsically interested because it shows for instance that the most use pinyin syllable in Chinese is "yi". Maybe because it is easier to prononce? Of course, the count of characters for this pinyin include all the characters in the database including the one that are never used.

What would be more interested is to couple a given pinyin with the frequency of use of the characters. Therefore, we will able to know what is the "most-said" pinyin (I think it is still "yi")

anyway thanks for the answer.

Link to comment
Share on other sites

The generation of the list is very simple. I think it's so obvious, I don't even know what to explain. You've seen the structure of the Unihan.txt file. You just need some programming skills to collect necessary data from it.

Accounting for frequency is not as easy as it seems at first. Unihan database has frequency rating for most, but not all, characters. Here is one I just picked randomly 鼋 that doesn't have. For those that have there is no information on what these numbers really mean. Are characters with kFrequency 1 ten times more frequent then 2 or three time more frequent? Besides there are often more then one pronunciation for the same character, and, although Mandarin pronunciation variants are said to be sorted by frequency, there are no real numbers you could use for calculations.

The more sensible approach would be to at least also use one of the character frequency lists, which are plenty on internet. This will at least give you real character frequencies, but you would still lack frequency of pronunciation of the particular character. What would be much better is to derive your data from real word frequencies. Most Chinese words have only one Mandarin pronunciation, with exception of some single character words. This would give something you can rely on, but unfortunately it's to much to do to just satisfy curiosity.

Ok, here is something for you to play with

This one is same as before, but accounting for tones and restricted to only gb2312 character set (previous was run for all GB extensions as well):

http://mandarinspot.com/static/pinyin-count-with-tones-gb2312.txt

This one is same, but for Big5 character set:

http://mandarinspot.com/static/pinyin-count-with-tones-big5.txt

Link to comment
Share on other sites

Writing to say thanks for the lists too. I couldn't agree more that making those lists is simple and obvious, but it would nevertheless have taken me several hours, let's say, to remember or figure out the way to actually implement the solution myself (shameful as it is to admit that). As someone who is seemingly studying Chinese language as much as a general subject of curiosity as to become proficient at using it, those lists are perfect for me. I have casually wondered about distribution of characters to syllables for some time.

约翰好

Link to comment
Share on other sites

I think this topic is nearly closed but I still would not be able to sort the Unihan txt file. the main reason is I lake some "programming skills". I used "c" before to do some easy physical computation (simulations and so one) but I wouldn't be able to sort and do statistics from a datafile.

does someone would happen to know some basics tutorial for such sorting?

cababunga, can you provide the code you used?

thank you again

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...