Jump to content
Chinese-forums.com
Learn Chinese in China

Character Frequency Analysis and Learning Site


sunlightatdusk

Recommended Posts

sunlightatdusk

I've put together another frequency list of Chinese characters, taking into account the usual 100,000 Chinese web pages. This one has a bit of a twist, as it translated the characters first into Pinyin, so that specific character-pinyin pairs could then be tallied.

http://readmandarin.com/research.htm

If you dig around the site, you'll see it's aimed towards teaching people to read Chinese.

Any feedback on the research, or on the site itself would be greatly appreciated.

Also, can anyone explain to me the etymology behind 的? White spoon? I've looked in a number of books and haven't gotten anywhere, and I'd like to update it for the site.

Thank you!

Daniel

Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

mr.stinky

a bit confusing with the line beneath each character, but i suppose i could get used to it.

what is the relation between # seen and cumulative %?

"Sentences, phrases and paragraphs of over 50 characters were included."

can i assume only paragraphs required a minimum 50-character count?

what percentage of text was discarded due to too-short paragraphs?

"Duplicate phrases were discarded."

did phrases have to be an exact match, or of a certain length to qualify?

how many times was "build a harmonious countryside" discarded?

Link to post
Share on other sites
sunlightatdusk

Toffeeliz:

I wasn't quite sure what you meant by the results being off. Please explain if you have a moment.

mr.stinky:

I've removed the lines under the characters so the page is easier to read.

Cumulative percentage gives you an idea as to how much of the text you would be able to recognize if you learned only a certain number of characters. The top 10 characters make up about 12.2% of the sampled text. The top 1000 characters make up about 90.3% of the sampled text.

Yes, paragraphs had to be 50 characters in length. The idea was that the program would take in text from actual articles, reviews and blogs. I'm not sure what % of lesser length text was discarded, but I suppose I could run the analysis and find out.

It was only duplicate paragraphs that were omitted. I've changed the description to reflect that.

Thanks for the great feedback!

Daniel

Link to post
Share on other sites
trevelyan

If you went through the work of discarding the HTML markup and the shorter sentences, the raw datafiles could be useful to others doing text processing. I'd be curious to look at them for word-level rather than character-level frequencies myself.

Link to post
Share on other sites
sunlightatdusk

trevelyan:

That's a cool idea. Would you be looking for words of 2-3 characters in length? How would you determine where one word begins and the other ends? Or would you just look for adjacent characters that come up often?

If you message me your email address, I'll send the text to you as an attachment.

Daniel

Link to post
Share on other sites
  • 2 weeks later...

Daniel:

Also, can anyone explain to me the etymology behind 的? White spoon?

"的 ...  

勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like."

source:http://www.kanjinetworks.com/

hope it inspires,

Ole

Link to post
Share on other sites
  • 1 month later...
self-taught-mba
勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like."

I've read the opposite. the spoon was the phonetic and the white was meaning. At least according to ABC dictionary.

But I'm not a character guy.

Link to post
Share on other sites

I copy/pasted the first 3000 characters into an excel spreadsheet; it weighed in at 415k, which isn't too bad. Two things that this list lacks, for my needs at least:

1. traditional characters

2. aditional readings

where applicable

Link to post
Share on other sites

勺 ladle (→ pour out contents; make contents evident) + 白 (phon) white (→ color that stands out) → target; mark; objective → obvious; hit; -like."

I've read the opposite. the spoon was the phonetic and the white was meaning

See here. for a more detail regarding the origin of 的。
Link to post
Share on other sites
:) it's very interesting, now i know that i know more than 1000 characters. i had 差不多 no problems in the first 1000. :( 以前我觉得用1000个汉字我可以说得很流利,不过现在我看1000太少了。
Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...