Jump to content
Chinese-Forums
  • Sign Up

Yet Another Knowledge Estimator


bozhidao

Recommended Posts

Hi,

I meant to post this a year ago when I put it online. Time flies :)

I put a small web app for character (and 2, 3 & 4 character word) knowledge estimation up at:

http://bozid.com/quiz

It's hosted on Dreamhost (those who read sinosplice know how that can be), on the cheapest plan, so my apologies if it's not the snappiest.

If you try it, please let me know if you think the estimates are in the right range for you.

I scraped the words from baike.baidu.com about a year ago, so the frequency is a little heavy on Olympic sports terminology in the mid-range... If there's any interest I'll see about scraping again (takes forever).

Thanks!

Bo

Link to comment
Share on other sites

It looks very nice.

It is not too well-suited for more advanced people, though. I gave up after 40 characters and 100 bigrams, and I found it mostly too easy.

It estimated around 3500 characters, and I know around 3800, so it is very accurate there. For two-character words, it estimated 9500. This might be right, it's hard to say. I don't have a list of all words I've ever learned, but I guess that my estimate would be in that ballpark.

I think that having pinyin and the answer makes it too easy to guess the answer even if you don't know it. Most characters will have a radical (which eliminates most meanings) and a phonetic (which eliminates most pronunciations).

But I can see how this could be a useful tool for people who don't have a meticulous record of everything they've learned.

Link to comment
Share on other sites

For me, it estimated just under 1100 characters which I think is a little low but certainly not way off.

When running the bigrams, it simply stopped on both attempts. (On the second attempt, I'd got to around 75/83 completed I think.) It seems a little easy to guess correctly pareticularly if you know just one character and use a process of elimination.

Link to comment
Share on other sites

Thanks for trying it out, renzhe!

It's good to get feedback from someone who is so much farther along in their studies than I am. I see that I should try to shorten the quiz for folks in the high range...

Currently, as you answer questions correctly it just walks up through the character list by frequency in steps of 100. So if you know ~3800 characters it will ask you 38 questions before it can even start probing your knowledge at the upper boundary, which typically takes another 5-15 questions.

I suppose if a user answers 3 or 4 in a row correctly it could start going up in steps of 200. That should cut around 15 questions for someone at your level.

I guess I'll just have to learn another few thousand characters so I can fine-tune it at that range ;)

Thanks!

Bo

Link to comment
Share on other sites

Sorry for the errors on the bigram test, HedgePig. I think I might be pushing the limits on the cheap-o dreamhost setup. :oops:

Congrats on 1100+ characters, though! That was my goal for last year -- I know how much work it can be. I think it should get easier (or at least more rewarding) from here on out, though, as we can read more interesting things.

For example: As of my last exhaustive scrape of baike.baidu.com (about a year ago) the 90% threshold was 1133 characters, so we should know ~90% of the characters there!

Bo

Link to comment
Share on other sites

Naturally :mrgreen: I tried the four-character words, and I got them all right, so it said I knew 4,400 of them (?). Anyhow, I think it is too easy, for the four-character words there are way too many geographical names and person names that can be easily guessed. Maybe you should restrict yourself to chengyu only :D

Also, the entries in the pinyin and the English columns match, which makes it much easier to guess in general. Maybe you could randomise both columns?

Link to comment
Share on other sites

chinopinyin: Thanks!

I actually generated my own frequency lists by crawling through baike.baidu.com and counting the occurrences of each character/word there. For definitions I used CC-CEDICT (http://cc-cedict.org/wiki/).

The algorithm I'm using stops the quiz when it thinks it has a good estimate. If it used an arbitrary number of questions it might have to stop before it reached a good estimate for some people. If you want to stop the quiz early there's a button for that. :)

chrix: Wow, I think you're more advanced than the quiz can handle!

Interesting that you say the columns match. They're not supposed to. I'll see if I can reproduce that.

Thanks!

Bo

Link to comment
Share on other sites

bozhidao, I think I might not have phrased this right. What I meant was, that in the pinyin and English columns, there's four options, and they form four pairs. Thus, if you don't know the meaning of a word, you can exclude options by matching them with the pinyin column. If you'd randomise this completely, it would take away this possibility.

Link to comment
Share on other sites

@bozhidao

Why don't you include on your website the list of the most common XX characters in order of frequency? I see it as very interesting in its own right.

Estimates for people with limited vocabulary may be less reliable, though, since there are many high frequency words in actual speech (eye,son,today, run,car ...) that may not be that frequent in baike.baidu.com

One could argue that worlists for the new HSK exam may be a good starting point

The first 150 words one should know are in http://www.confuciusinstitute.qut.edu.au/docs/Chinese_Proficiency_Test_1_Vocab.pdf

The first 300 words in http://www.confuciusinstitute.qut.edu.au/docs/Chinese_Proficiency_Test_2_Vocab.pdf

There have to be 4 additional lists for the other levels of this exam that reach up to 5000 characters, but I have not seen them

Some other lists

http://lingua.mtsu.edu/chinese-computing/statistics/

http://www.zein.se/patrick/3000char.html

Link to comment
Share on other sites

chinopinyin: Yeah, the zein.se list is great, and the mtsu lists were an inspiration.

I opted to make my own lists for a few reasons:

1. I wanted to have a consistent source for words of each length

2. I wanted to make sure my lists represented real, modern usage

3. It's fun :)

If you're interested, here's a one-character list from last February. This isn't the exact list used in the quiz, but it was collected at around the same time.

http://bozid.com/download/frequency.txt

Columns are:

1. rank

2. character

3. total encountered

4. cumulative percentage

There are some obvious anomalies there (e.g. 词), but in general I've been pretty happy with it.

Bo

Link to comment
Share on other sites

stelingo: That's not cheating at all! In fact, if you create an account it will keep track of your scores over time for you so you can chart your progress. If you take the test multiple times in one day it uses the average for that day's score.

Thanks for trying it out!

Hofmann: That's a good point. If you don't know an answer you'll get a more accurate estimate if you don't guess. Like stelingo said, you can just click submit without answering.

And yes, it only has simplified characters for now. No offense to any traditional character users, it's just what I've been studying myself. :)

Bo

Link to comment
Share on other sites

I think it's definitely too easy.

I was doing the two-character test, I got bored after 125.

It said I knew 11196 of those words. Dunno, a lot of the words were city names and I just matched to the English without actually knowing the city.

It's hard to say how accurate it is. I don't keep count of how many characters I know, nor how many words I know. I'd say I know more than 10,000 words for sure, but I doubt they are all two character words.

Definitely should limit the length of the test. I really got bored in the end.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...