Character vs Word Frequency Comparison (Updated)

December 28, 2012 at 03:39 PM

Several years ago, when I encountered one of those nice, smooth curves that claimed that if you know x characters, you’ll understand y% of a given text, I first was delighted. So, just finding those characters for an acceptable level would minimize the effort of learning to understand a text. Somebody must have done that already.

But soon the question marks lined up.

My first problem was, how did they measure “understanding”? Only recently, I was told that they didn’t. The reasoning gets clearer when you instead use words like the OP’s “recognition”.

Other issues have already been mentioned in this thread, for example the relationship between characters and words. Doctoral theses have been written on how to define or identify a word, but there seems to be no easy solution. You’ll all be aware of the possible problems when encountering a unit of text consisting of, say, just the four characters ABCD. There might exist all the “words” A, B, C, D, AB, BC, CD, ABC, BCD, and ABCD. If you recognize the four separate characters, that’s no guarantee that you can split ABCD correctly into words or even that you have learned the emerging words. (Going for AB + CD seems to work best in a majority of cases, but that’s statistics, not understanding.)

And there’s grammar. The classic is of course 我是下午买的票. You can’t translate “I am a ticket bought this afternoon” to “It was in the afternoon that I bought the tickets” from only knowing the characters/words.

I wish that I could have written a clever conclusion to this rant, but I’ll just end by hoping that some clever person some day in my lifetime will merge statistics and corpus linguistics in a way that increases learning efficiency.

December 31, 2012 at 07:46 AM

I wish that I could have written a clever conclusion to this rant, but I’ll just end by hoping that some clever person some day in my lifetime will merge statistics and corpus linguistics in a way that increases learning efficiency.

I don't know what you have in mind, but I'lld say don't hold your breath for it. General statistics and corpus analyses are imho dead-end streets when it comes to learning efficiency. The vocabulary you come across is far too personal, a general corpus is at best a rough approximation. At the absolute beginner level they may be not too bad, basic structural vocabulary is needed by everyone. But even then, with only structural particles you're not able to build sentences and a traveler might prioritize vocabulary to ask directions, buying tickets and ordering food where someone learning for business reasons might prefer vocabulary to introduce himself, to make appointments etc. I'ld say for vocabulary learn the vocabulary you actually come across. E.g. I created my own corpus by choosing a couple of books I want to read and first sorting it on frequency and then learning the topmost occurences in order of first appearence. At a higher level you might choose to learn all the words you have to look up or so but be sure to make it personal.

January 4, 2013 at 02:40 PM

I wish that I could have written a clever conclusion to this rant' date=' but I’ll just end by hoping that some clever person some day in my lifetime will merge statistics and corpus linguistics in a way that increases learning efficiency. [/quote']
If it's just about optimizing your chances of knowing the words in a randomly chosen text, the best way from a probabilitic standpoint is to start learning all the words in the language (or corpus, or the text in question) in order of their word frequency. But that's very boring and I don't know of anybody who studies that way. Plus, the long tail of rare words is *very* long, and the gains made per word get exponentially (logarithmically?) small. Ultimately, there's no getting around the fact that everyone has to learn a lot of words to be fluent, and there's no magic shortcut.

If you mean learning a lot of words but also some amount of rare characters as leverage in guessing at a larger number of rare words, that's an interesting question. As an obvious example, I wish I could go back and learn all the phonetic characters used in foreign transliterations as a single set, as it would have saved me from learning characters like 迪, 萨, 霍, etc. with their Chinese character meanings (and not realizing their primary usage as phonetic) in a piecemeal fashion.

April 27, 2013 at 05:25 AM

This is an interesting topic. From the first chart,

3000
Characters: you will recognize 99.242% of all text

Words: you will recognize 76.993% of all text

I highly doubt I know 3000 characters even though I can read most Chinese text without a problem o.O How was it determined? Like, what kind of text is this referring to (like, if you compare a casual diary to a scientific paper)?

April 27, 2013 at 08:58 AM

How was it determined?

The OP mentions his methodology in the first post.

April 27, 2013 at 01:16 PM

I had a feeling you were using the word incorrectly. Radicals are simply the 214 (give or take, depending on which system you're talking about) components used as section headings in dictionaries. Hence the Chinese term 部首, which literally means section heading. You're talking about character components, not radicals. Components can be semantic, phonetic, pictographic, etc., and understanding how those work can indeed help with learning characters. Knowing radicals can only help you make a decent guess as to where to find the character in a dictionary.
I can also guarantee that the radicals aren't there to "give you hints about the meaning and pronunciation." It's an artificial system created as a way of indexing characters in a dictionary long after the writing system had already developed, by people who didn't fully understand how the writing system developed (for instance, the 說文解字 is wrong most of the time, I've seen claims of up to 90%), not something that is all that useful for explaining how characters actually work.

Might want to go back and do some actual research before you look the fool. I take it you're not familiar with phonetic determinant characters.

And no, they don't only help you make a guess, they help substantially with the way you learn the characters. Since radicals are written the same way regardless of character it makes it easier to learn to write the character. The memorization process is easier as well because rather than learning all those strokes you just have to learn the radicals that make up the character.

And yes there are non-radical components as well, but the ones I've seen are only such because nobody felt the need to declare them to be radicals, treating them as radicals is not going to cause any harm other than the inability to use them to look up words in a dictionary. And even that's moot as the traditional Chinese organization system doesn't work very well.

As far as giving hints, the system of characters has been evolving for a good long time and it seems rather a matter of semantics to argue about the causality of design decisions by a group of people that aren't available to explain. Sort of like talking about the tendency of some languages to have root words, suffixes and prefixes, it's a very similar situation IMHO, and there because having to learn things that aren't systemized is substantially harder.

At any rate, the OP's numbers really shouldn't have been posted as they give an extremely innaccurate sense of how little is needed to progress when it's at best true if you're controlling the inputs. In real life I observed nowhere near that level of recognition and or comprehension.

April 27, 2013 at 03:31 PM

I don't think the name-calling is necessary. Bring it back down a notch.

I suggest you go back and read what I said again, and do your research. When we get on the same page as far as using accurate terminology, then we can talk. I mentioned some books earlier in this thread that would make a good starting point.

April 27, 2013 at 05:39 PM

@imron, oh I didn't realize there is a link to a frequently list on the site. I was using the search bar on the top of the page so I thought OP had to manually enter certain characters LOL.

Thanks

April 27, 2013 at 06:10 PM

OneEye, there was no name calling in that post. Anyways, I'm out of here, if you're going to be ridiculous, I have better things to do with my time.

Bottom line is that if you can't be bothered to read what I've posted, then there's no point in me posting anything at all. I was willing to grant the OP the benefit of the doubt, but ultimately I have yet to see anything that suggests that this applies to texts that aren't specially prepared for the purpose of learning to read. WIthout word breaks or indication given as to how the clusters of characters are broken up, you need a much larger number of characters to achieve those percentages without looking up all those characters in a dictionary.

April 27, 2013 at 06:36 PM

before you look the fool
if you're going to be ridiculous

no name calling

Huh?

One of us is trying to have a reasonable conversation. The other is being pretty rude.

April 27, 2013 at 06:59 PM

OneEye, quit wasting my time. You accused me of calling you names and you can't even present any evidence that I've called you a name. Not even one.

I warned you to be more mindful or you were going to look foolish. Now, you've gone and done it. You look quite foolish making accusations that you can't even back.

Anyways, I'm out of here. Don't bother responding to me anymore in here as I won't be responding. You might also consider apologizing, but I don't expect that somebody such as yourself would ever bother.

April 27, 2013 at 07:12 PM

Ah, I should have learned long ago not to feed the trolls.

You had me going there for a minute, nice job. Jumped the shark a bit with that last post though.

February 8, 2014 at 06:08 PM

Thanks howlingfantods, could share the files (cf. your 1st post) again?

Thanks!

T.

February 9, 2014 at 01:29 AM

I've found this convincing resource here: link. This is a ranking of characters and words (1+ characters) by frequency of use, in a corpus of films and television subtitles (46.8 million characters, 33.5 million words).
This is the most exhaustive list I've come by so far.

Sign In

Character vs Word Frequency Comparison (Updated)

Recommended Posts

Lugubert

Link to comment

Share on other sites

Silent

Link to comment

Share on other sites

c_redman

Link to comment

Share on other sites

LinBB

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

hedwards

Link to comment

Share on other sites

OneEye

Link to comment

Share on other sites

LinBB

Link to comment

Share on other sites

hedwards

Link to comment

Share on other sites

OneEye

Link to comment

Share on other sites

hedwards

Link to comment

Share on other sites

OneEye

Link to comment

Share on other sites

Tatarik

Link to comment

Share on other sites

Tatarik

Link to comment

Share on other sites

Join the conversation