Jump to content
Chinese-forums.com
Learn Chinese in China

word frequency analysis of HSK level 6 practice exams


大块头
 Share

Recommended Posts

It's commonly said that in order to get a good score on an HSK exam your vocabulary should be about twice as large as the official Hanban vocabulary list. For example, to get a good score on the HSK 6 (5k words on current official vocabulary list) you should really have a vocabulary of about 10k words. But how much truth is in that heuristic?

 

I fed a corpus of 21 official test papers for the HSK level 6 through some Python code to perform a word frequency analysis (segmentation was done with jieba). The plot below summarizes the result of this analysis.

  • If you learned 10k words from the first six randomly selected test papers in this analysis you would know about 91% of the words on the test.
  • If you followed the (inadvisable) study strategy of learning all your vocabulary from practice tests, after memorizing the vocabulary from 20 tests and accumulating a vocabulary of approximately 20k words you would still only understand about 95% of words on the test.

 

The attached csv file contains 20k+ words and their frequency of occurrence in this corpus.

 

 wordcount_plot.thumb.png.38a751175aefcd0f4f7a346805b1de19.png

 

misc. practice test properties:

  • contained scripts for all the spoken audio
  • average word length = 10508
  • total word length = 220675

 

ideas for future work:

  • compare word list to HSK vocabulary list
  • compare word frequencies to word frequency lists from published corpora

wordlist.csv

  • Like 2
  • Helpful 2
Link to comment
Share on other sites

Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

To what extent are those 'extra' words necessary to answer the questions? Or are they filler? You don't necessarily need to understand every word. I picked this reading question at random: 

北京四合院之所以有名,首先在于它的历史_____ 。自元代正式建都北京, 大______ 建设都城时起,四合院就与北京的宫殿、衙署、街区、坊巷和胡同 同时_____ 了。

A 悠久 规模 出现 B 深远 面积 建立 C 长久 部分 创造 D 灿烂 格局 成立

 

How much of the sentence could you lose and still get the answer? Nobody's going to get it wrong for not knowing 衙署, for example. So are the questions designed to test your knowledge of those extra X,000 words, or are they designed to be answerable even if you don't (which is a very useful real world skill)? 

 

It's quite relevant for how you should prepare. I'm obviously not going to discourage anyone from learning more words. But I'm not sure you *need* to.

 

Link to comment
Share on other sites

35 minutes ago, 大块头 said:

 

ideas for future work:

  • compare word list to HSK vocabulary list
  • compare word frequencies to word frequency lists from published corpora
  • Use magic to delete all non-HSK vocab list words from an exam, and see how many questions can still be answered.
  • Like 3
Link to comment
Share on other sites

30 minutes ago, roddy said:

To what extent are those 'extra' words necessary to answer the questions? Or are they filler? You don't necessarily need to understand every word.

 

Assuming (maybe incorrectly) that the distribution of the the unknown words is similar to what you'd expect from arbitrary Chinese text, it's worth recalling David Moser's famous words of wisdom:

Even though you may know 95% of the characters in a given text, the remaining 5% are often the very characters that are crucial for understanding the main point of the text. A non-native speaker of English reading an article with the headline "JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS" is not going to get very far if they don't know the words "jacuzzi" or "phlebitis".
  • Like 1
Link to comment
Share on other sites

Pretty cool.

 

I'd like to see a column of how many exams each word occurred in since it might be more useful for lower count words, ie. a word that occurs three times in one text is less useful than a word that occurs one time in three texts. 

 

Edit: Not sure if you are including transcripts for the audio, but if you haven't, I've put some transcripts up in the transcription project @大块头

Link to comment
Share on other sites

8 hours ago, roddy said:

Use magic to delete all non-HSK vocab list words from an exam, and see how many questions can still be answered.

Not sure if this belongs in its own thread, but here are two separate LISTENING questions (transcribed) from an HSK 6 exam. 

 

- The different levels of HSK words are highlighted in different colours.

- Un-highlighted words are non-HSK.

- The words have been split according to a larger dictionary so some of the multi character words might be made up of HSK words, but are together (and so, hidden) because there is a separate dictionary entry. 

 

For some reason 所以 isn't highlighted even though I'm pretty sure it's an HSK word, but I think the rest is mostly correct. 

 

Another problem is that characters like 事 and 中aren't separate words in the HSK list, so they are hidden too, making it even more obscure in some places where it would really be easy to understand. 

 

Non HSK words hidden:

clipImage_07082020122318.thumb.png.3eeeb6435f7b307c00f6876c7ae72251.png

 

All words shown:

clipImage_07082020121444.thumb.png.6e8f65547ab5a3247bff35d72def54cb.png

 

Link to comment
Share on other sites

9 hours ago, Demonic_Duck said:

Assuming (maybe incorrectly) that the distribution of the the unknown words is similar to what you'd expect from arbitrary Chinese text, it's worth recalling David Moser's famous words of wisdom:

It isn't (just) about the words, it's about the question. Take that sentence as an example...

 

JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS

 

I can write three questions for that off the top of my head...

 

1) What section of a newspaper are you most likely to find this headline in? A) Medical B) Politics C) Cooking 

2) People interested in this article might have problems with their A) legs B) hearing C) joints

3) This article might encourage some people to A) eat a certain food B) buy a new type of bath C) wear different clothes

 

Depending on what I want to test. You could probably get 1) right as long as you know 'treat' in the verbal 'treat a disease' sense and don't get confused with the noun 'tasty snack' sense. 2) you'd need to know what part of the body phlebitis usually affects, and that'd be an unfair question for anyone but a med student. 3) You need to know what a jacuzzi is - it's not a pizza or compression leggings. 

 

Don't get me wrong, knowing the extra words will make for a much more pleasant exam experience. But unless we see that you need to know the extra words to answer the questions, you might be better off a) getting used to the idea you won't know all the words and b) spending the extra time on knowing what they do test very well.

 

To be clear, I don't know either way. But there being 10k words in an HSK6 exam does not necessarily mean you need to know 10k words to pass it. 

Link to comment
Share on other sites

13 hours ago, Demonic_Duck said:

A non-native speaker of English reading an article with the headline "JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS" is not going to get very far if they don't know the words "jacuzzi" or "phlebitis".

I am in fact a non-native speaker of English who does know what a jacuzzi is but not what phlebitis is (haven't looked it up, will do so later). I do know/guess/infer that it's a type of disease. I think if I were interested in an article with this headline, I'd have a decent idea of what phlebitis is after reading it.

 

4 hours ago, roddy said:

1) What section of a newspaper are you most likely to find this headline in? A) Medical B) Politics C) Cooking 

2) People interested in this article might have problems with their A) legs B) hearing C) joints

3) This article might encourage some people to A) eat a certain food B) buy a new type of bath C) wear different clothes

1 is easy. 3 is also easy since I know what a jacuzzi is. 2 is more difficult, but since I know what a jacuzzi is, I know it's not B and probably C. (Also this is not a good question because we have joints in our legs.)

 

4 hours ago, roddy said:

You could probably get 1) right as long as you know 'treat' in the verbal 'treat a disease' sense and don't get confused with the noun 'tasty snack' sense.

I'm now amused by the thought of feeding a jacuzzi (could be some kind of canape) to a phlebiti (sounds like a cute animal, a bit like a quogga) as a snack to reward it for good behaviour.

  • Like 1
Link to comment
Share on other sites

On 8/7/2020 at 6:41 AM, roddy said:

JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS

 

If we assume in this hypothetical English exam that the words "FOUND" "EFFECTIVE" "IN" and "TREATING" are on the official vocabulary list for that level, but "JACUZZIS" and "PHLEBITIS" are not, then the whole exam would be poorly designed.

 

On 8/6/2020 at 8:52 PM, 大块头 said:

If you learned 10k words from the first six randomly selected test papers in this analysis you would know about 91% of the words on the test.

 

Again, this either sounds like a poorly designed test or they are dishonest about the HSK level vocabulary requirements in order to not discourage people.

 

 

Link to comment
Share on other sites

I think a key takeaway from this data is that Hanban isn't dumbing down the material it presents in the current HSK 6 test. If they were trying to limit to scope of vocabulary on the test to some top-secret word list we'd see the unique word count level off at a faster rate.

 

Once I can get practice tests for the HSK 7-9 I'll compare it to this corpus to see if it looks like the source material for that test is at a higher reading level. It may be that they'll use the same sources but just ask harder questions.

 

On 8/6/2020 at 3:15 PM, roddy said:

To what extent are those 'extra' words necessary to answer the questions? Or are they filler? You don't necessarily need to understand every word.

 

On 8/6/2020 at 3:15 PM, roddy said:

are the questions designed to test your knowledge of those extra X,000 words, or are they designed to be answerable even if you don't (which is a very useful real world skill)? 

 

It's quite relevant for how you should prepare. I'm obviously not going to discourage anyone from learning more words. But I'm not sure you *need* to.

 

Expanding one's vocabulary can definitely be an exercise in diminishing returns. Here's a plot I made of word frequency data from two different corpora: the SUBTLEX-CH corpus (derived from subtitles) and the Lancaster Corpus of Modern Chinese (derived from a balance of written and spoken sources).

 

word_freq_comp.thumb.png.b9757beea9ea99fa6841b55cccd05ab1.png

 

If the horizontal axis isn't plotted in log-scale the curve looks like a cliff. After a certain point you're better off learning vocabulary related to your profession or using the vocabulary you know more effectively.

 

22 hours ago, markhavemann said:

I'd like to see a column of how many exams each word occurred in since it might be more useful for lower count words, ie. a word that occurs three times in one text is less useful than a word that occurs one time in three texts. 

 

That's a good idea. I'll split the csv up the next time I find time to work on this. Like any piece of text these practice tests exhibit "burstiness".

 

To pedantically quote my machine learning textbook:

Quote

Although the multinomial classifier is easy to train and easy to use at test time, it does not work particularly well for document classification. One reason for this is that it does not take
into account the burstiness of word usage. This refers to the phenomenon that most words never appear in any given document, but if they do appear once, they are likely to appear more than once, i.e., words occur in bursts.

 

Link to comment
Share on other sites

2 hours ago, 大块头 said:

After a certain point you're better off learning vocabulary related to your profession or using the vocabulary you know more effectively

Yup. I’ve been trying to tell people this for years. 
 

Learning vocabulary from general word lists e.g. the hsk or other lists Is inefficient
 

This effect will be noticeable by at least hsk4, and probably earlier.  

  • Like 1
Link to comment
Share on other sites

  • 1 month later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...