Jump to content
Chinese-Forums
  • Sign Up

What's the Best Way to Estimate Your Vocabulary Size?


Nathan Mao

Recommended Posts

I used to pull out my little red dictionary, take 10 pages at random and count the number of character entries I knew the meaning for, found the average for those 10 pages, and then extended that for the number of pages in the dictionary.  I would cross-check by picking 2-3 pages from the character reference section at the beginning of the dictionary, averaging those pages, and then extending that number out for the number of pages in the reference section.

Most of the time, those methods came out pretty close, and consistent with my ability to read newspapers without a dictionary.

 

But I don't carry around my little red dictionary anymore.

I took one online test, but it maxed out at 500 characters or so.

 

I've also looked at the list of most common characters, and tried to estimate when, out of any given 20 characters, I stopped recognizing more than half, and figured that was a good estimate of my vocabulary total.

 

Clearly, these methods aren't all that accurate.  I could be off by as much as 200 characters.

 

...not that it matters, but it is nice to know, right?

 

Any suggestions?

Link to comment
Share on other sites

Yeah, this is a tough problem, especially for those of us that picked up a lot of our words based upon real world need rather than based upon some sort of rigorous curriculum.

 

The statistician in me suggests taking that list you've got and randomly selecting a certain number of characters to test yourself on. There's formulas out there, but you can get 95% confidence in the result if you choose an appropriate sample size. I'm too lazy to pull up the actual formula at the present, but the formulas are online somewhere.

Link to comment
Share on other sites

@Demonic_Duck as far as I know the Chinese don't have the kind of clear distinction between word and character that you might expect. It took me about 6 months to figure that out. Hence the multiple terms for dictionary. 词典 and 字典, they're not quite the same in Chinese, but if you want to do something with them that requires working with individuals words, you have to explain what a word is. I think part of the confusion comes from particles, things that are sort of like words, but sort of not.

 

Feel free to enlighten me on the matter, I still find the situation to be somewhat confusing.

Link to comment
Share on other sites

"字" is character, "词" is word. There's a clear distinction, but they may get used somewhat interchangeably in non-technical contexts.

 

The word for "vocabulary" (as in, the number of words one knows) in Chinese is "汇量". I'm not sure if "汇量" is ever used for the equivalent concept applied to characters.

Link to comment
Share on other sites

I'd say that since Chinese is structured differently than English, there is no equivalent to the 词字 delineation, nor should there be.

Just one of those concepts that can't be directly translated.

Link to comment
Share on other sites

Give this a shot - I found it fairly accurate the few times I tried in the past. http://www.zhtoolkit.../apps/wordtest/

 

Accurate according to what? something else that you trust even more? Do tell.

 

I tried the zhtoolkit tests, and while I can't fault the methodology, they said I knew 1400 characters and 14,000 words. I would have put myself a few hundred higher on characters and a few thousand fewer on words, but like I said, the methodology sounds pretty reasonable.

Link to comment
Share on other sites

If you're looking to estimate the number of characters you know, the best approach is to get a list of the top 5000 and count. It will take half an hour, but you'll know for sure. Every test is going to be off by quite a large amount, sometimes they are off by an order of magnitude.

Estimating actual vocabulary (words, not characters) is a lot more difficult. AFAIK, linguists typically use the dictionary method you described.

On an unrelated note, I think that the 词 / 字 is not helpful in this particular case, because a 词 is typically assumed to consist of more than one 字 (whether this is correct, I don't know). I prefer to refer to vocabulary items which can be used alone as "words", regardless of how many characters they contain.

I would count characters to estimate literacy, and "words" to estimate vocabulary.

Link to comment
Share on other sites

The statistician in me suggests taking that list you've got and randomly selecting a certain number of characters to test yourself on. There's formulas out there, but you can get 95% confidence in the result if you choose an appropriate sample size. I'm too lazy to pull up the actual formula at the present, but the formulas are online somewhere.

I'd be interested to see the formula and how it was derived. Is this specifically created for estimating vocabulary size, or a general method used for something else (e.g. opinion polls)?

Very often, such formulae are based on very strong assumptions, which never hold for language. For example, if you assume that characters are i.i.d., you will be off by at least an order of magnitude.

To estimate vocabulary, you need to know the exact frequency of each individual character you are testing, and that depends very much on the corpus (once you're above the most common 100 or so). Some online estimators do incorporate all this information, and they are still woefully off in practice.

Link to comment
Share on other sites

I like the Nciku character test. Its method of estimation is described on the site. Perhaps everyone can try the different sites and report back the test results from each.

http://blog.nciku.com/blog/en/?p=2132

The test starts with the most simple common hanzi, but quickly moves on to harder characters as the test goes on – depending on your Chinese level, you’ll be asked between 10 and 60 questions. Although there are tens of thousands of Chinese characters, this test only includes the most common 5000, which is about the same number that a typical educated Chinese person would know.

  • Like 3
Link to comment
Share on other sites

@Nathan, that's ultimately my perception there. At present there are no word breaks between words in any printed materials that I've seen except when you have punctuation. So, for the typical Chinese person, I see no reason why they would naturally think in terms of words the way that we do in English. And in my experience, often times they don't.

 

I don't really see this as right or wrong, but it is something that has to be accounted for.

 

@Renzhe, I guess it's not so much of a formula as a representative sample designed to provide a level of certainty. AFAIK, that's how those estimation programs work. They'll typically take a certain number of high frequency words, lower frequency words and low frequency words and then based upon the rate at which you correctly answer the questions give you an estimate about how many characters you're likely to know.

 

Having written that, I think it's more complicated than I was thinking initially. I think that 4.3 and 4.4 would probably be the most likely ones to be relevant here: https://en.wikipedia.org/wiki/Sampling_%28statistics%29

Link to comment
Share on other sites

Yeah, both the zhtools and the nciku estimators use stratified sampling. It's just that there are still real problems when doing this for vocabulary estimation.

But, back on topic, I did both, and they gave me very similar results -- about 4000 characters. I'd say that this is reasonably close to the real value, probably a bit too high.

The most accurate approach is still counting :D

Link to comment
Share on other sites

I hit 3700 on zhtools and 3390 on nciku.

I think the nciku is more accurate for me, but I also think I told zhtools I recognized a few characters that I actually didn't. So I think about 3400 seems about right for my level.

 

My previous peak was about 3600 characters, back about 10-12 years ago...I declined a lot when I stopped reading regularly, but it is all coming back, and I'm filling some gaps I didn't know I had then.  My listening and speaking is much better than then, too.

 

But I seem to have switched from a visually-based understanding to a more aurally-based understanding.

 

I should clarify: I didn't want to imply that my results actually indicated nciku was objectively more accurate than zhtools.  The best explanation for the difference in estimates is operator error.

Link to comment
Share on other sites

*shameless plug*
I've recently released a tool that will help you figure this out (see discussion here).  It segments and parses Chinese text and allows you to mark words as known/unknown, and then keeps track of words that you know over time.
 
So, you could spend a month (or longer) doing daily reading with a variety of different content with different difficulty levels, and at any point, you can find out exactly how many words you have marked as known.

 

You won't get an instant result like those online tests, however the more you use it the more accurate that number will be.  It will also reflect your active vocabulary.

 

*shameless plug*

Link to comment
Share on other sites

My only obstruction to using it I don't do much online reading.  Lately it is all books or television serials.

 

If/when I switch over to more online texts, I certainly will use that tool.

Link to comment
Share on other sites

Hmmm, I got 760 on zhtools and 450 on nciku, so I believe the truth may lie somewhere in between :D :D A few months ago I browsed through a book with 300 most common Chinese characters and I knew all of them, so I think the numbers might be somewhat correct :D Not that I really need this information, but sometimes people get curious like that :D

Link to comment
Share on other sites

I got a 1995 on nciku and 2200 on the zhtools test, so it seems they are both about the same. Looking at other people's test results, it seems like the zhtools is consistently a little bit higher. I also did the zhtools word test, and it says I know 18333 words, which I think is too high. 

 

I was just curious how I'd do. 

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...