Jump to content
Chinese-Forums
  • Sign Up

HSK 3.0 ... new, new HSK?


realmayo

Recommended Posts

2 hours ago, 叫我小山 said:

made an Anki deck combining all of the 普通话水平测试 vocabulary; the HSK3.0 vocabulary

Someone found my flashcard decks. Cool.

 

2 hours ago, 叫我小山 said:

then removed all of the individual characters

I wouldn't remove all of the individual characters, as even within the first few dozen entries on the a PSC(2) list 癌  & 庵 are characters that don't get featured as parts of entire words.

 

2 hours ago, 叫我小山 said:

I figure if you learn these words you probably won’t run into any words that you: a) won’t understand from context and from the characters encompassing it, or b) would find extremely rare and probably don’t need to know (as you might hear it less than once a year or something).

You'd probably want to include the first 2000 idioms, from this list,  as well then.

Btw, what dictionary did you use for the anki entries? CC-CEDICT? I've been looking at doing something similar, but wanted to have both my dictionaries of "Oxford C-E" and "现代汉语规范词典" as entries. But, there probably isn't an automatic progress you can use for that is there? Also for the idioms one would probably want to include the "多功能成语词典", which is just a treat.
 

Link to comment
Share on other sites

5 hours ago, 叫我小山 said:

I made an Anki deck combining all of the 普通话水平测试 vocabulary; the HSK3.0 vocabulary; as well as the current HSK1-5 (as I’ve learnt up to that point now) then removed all of the individual characters (as I have a deck for that already), and it came to approximately 15,500 words. I figure if you learn these words you probably won’t run into any words that you: a) won’t understand from context and from the characters encompassing it, or b) would find extremely rare and probably don’t need to know (as you might hear it less than once a year or something)

 

Have you tested that assumption with actual books you checked in CTA?

 

I have made a CTA word list consisting of the more comprehensive old HSK (pre 2010), HSK 3.0. and 普通话水平测试.

 

Then I tested a NY Times bestseller:

 

    Total    191.024
    Known    163.423
    Percent Known    85,55%
    Unknown    27.601
    Percent Unknown    14,45%
    Unique    14.533
    Known    8.112
    Percent Known    55,82%
    Unknown    6.421
    Percent Unknown    44,18%
 

I think it is important to know that the 普通话水平测试 does not equal every word a Chinese College student knows. Rather it is an extra set of words they need to know on top of the "who knows how many words that are assumed to be common knowledge".

If the College student only knew the words of 普通话水平测试 that person would struggle with said NY Times bestseller:

 

    Total    191.024
    Known    84.019
    Percent Known    43,98%
    Unknown    107.005
    Percent Unknown    56,02%
    Unique    14.533
    Known    3.784
    Percent Known    26,04%
    Unknown    10.749
    Percent Unknown    73,96%
 

Link to comment
Share on other sites

@Jan Finster A lot of those "unknown words" can be guessed by the reader due to knowing the constituent characters though... 73.96% unknown? Seems rather extreme. How many of those are proper nouns?

 

2 hours ago, Jan Finster said:

think it is important to know that the 普通话水平测试 does not equal every word a Chinese College student knows. Rather it is an extra set of words they need to know on top of the "who knows how many words that are assumed to be common knowledge".


Actually, the 普通话水平测试 is a test of words that are assumed are common knowledge. It's made by the same commissions that created the HSK. It's literally just a glorified frequency list. Except for lacking most common idioms, and overemphasizing neutral tones and erhua the PSC is literally just the HSK on steroids. There is a REALLY large overlap.

Link to comment
Share on other sites

1 hour ago, Weyland said:

A lot of those "unknown words" can be guessed by the reader due to knowing the constituent characters though... 73.96% unknown? Seems rather extreme. How many of those are proper nouns?

 

Why do only nouns count? Personally, verbs and to some extent adjectives are just as important.

 

The book I am talking about is a non-fiction book, but again a NY Times Bestseller, so while there are some "technical terms" it is written for an educated general audience.

 

1 hour ago, Weyland said:

Actually, the 普通话水平测试 is a test of words that are assumed are common knowledge

 

For fun, I ran the highly imaginative sentence 我是学生 against 普通话水平测试 and 学生 is not even on the list. I tried another example: 外面很阳光 and 阳光 is not on the list. So, this cannot possibly be a comprehensive reflection of common knowledge.

 

1 hour ago, Weyland said:

It's made by the same commissions that created the HSK. It's literally just a glorified frequency list. Except for lacking most common idioms, and overemphasizing neutral tones and erhua the PSC is literally just the HSK on steroids. There is a REALLY large overlap.

 

When I run the old HSK (pre 2010), which is supposed to be the more challenging HSK version, against 普通话水平测试 [with the latter being the reference word list), I get: 

 

    Total    8.583
    Known    3.551
    Percent Known    41,37%
    Unknown    5.032
    Percent Unknown    58,63%
    Unique    8.583
    Known    3.551
    Percent Known    41,37%
    Unknown    5.032
    Percent Unknown    58,63%
 

How is this a large overlap? I was shocked!

Link to comment
Share on other sites

18 minutes ago, Jan Finster said:

For fun, I ran the highly imaginative sentence 我是学生 against 普通话水平测试 and 学生 is not even on the list. I tried another example: 外面很阳光 and 阳光 is not on the list. So, this cannot possibly be a comprehensive reflection of common knowledge.


I don't know what list you're using, but here is 学生 at #5427. And here is 阳光 at #5526.

EDIT: Also, when used as an adjective ”阳光" can only be used to describe "open to the public (e.g. an investigation) or "when someone is upbeat / cheery". Otherwise, it means "sunshine" and is a noun. You're probably thinking of 晴朗, which is part of HSK6 and the PSC.

...

 

18 minutes ago, Jan Finster said:

When I run the old HSK (pre 2010), which is supposed to be the more challenging HSK version, against 普通话水平测试 [with the latter being the reference word list)


What lists are you even using? We have a list for the new HSK3.0 with 11,092 words. And the PSC is 17,055 words.

 

18 minutes ago, Jan Finster said:

Why do only nouns count? Personally, verbs and to some extent adjectives are just as important.


It isn't about it "being important", but rather about it being "various". You can define one proper noun, like a name or an invention and then use it from then on without it having to be "known" by anyone past or present. E.g. the town name of "Llanfairpwll-gwyngyllgogerychwyrndrob" in Wales, is a proper noun. Heck, every person's name on this forum is proper noun.

So in the future, before you "see what the overlap is" you might want to first prune that list of words with, I don't know, a dictionary. If it isn't in a popular, everyday, dictionary then it probably is either a proper noun or the program made a mistake when segmenting the words.

Edited by Weyland
阳光 grammar lesson
Link to comment
Share on other sites

1 hour ago, Weyland said:

I don't know what list you're using, but here is 学生 at #5427. And here is 阳光 at #5526.

 

I used this list: https://www.chinese-forums.com/forums/topic/37109-psc-普通話水平測試-vocabulary-list/

 

Do you have the whole list as txt or xls so I can compare where the mistakes are?

 

1 hour ago, Weyland said:
1 hour ago, Jan Finster said:

When I run the old HSK (pre 2010), which is supposed to be the more challenging HSK version, against 普通话水平测试 [with the latter being the reference word list)


What lists are you even using? We have a list for the new HSK3.0 with 11,092 words. And the PSC is 17,055 words.

 

I was not talking about HSK 3.0., but the old HSK 1.0. I found that list here:

https://www.chinese-forums.com/forums/topic/53566-old-hsk-vocab-lists/

 

So, basically I ran https://www.chinese-forums.com/forums/topic/37109-psc-普通話水平測試-vocabulary-list/ against https://www.chinese-forums.com/forums/topic/53566-old-hsk-vocab-lists/

 

Let me know if there are mistakes in those lists. 

 

 

Link to comment
Share on other sites

I should first state that this deck is for personal use, so it’s purpose is a bit custom to me, but it solely contains the PSC, HSK3.0 and HSK1-5. (Someone could add the 2010 HSK level 6 as well if they wanted to be thorough, but I assume that any vocabulary in it would already be in the PSC and HSK3.0).

 

Weyland, I removed all the individual characters because the 3,000 most common ones will be the base knowledge and any single characters after that that I encounter “in the wild” I will add manually.

 

成语 are a completely other matter, as it seems natives knows multiple thousands. I have a 成语 book with over 10,000 and my wife breezed through it and knew almost every one. They just have exposure to them much more than a lowly foreigner student. I will be adding the most frequent 2,000 as you suggested, because there aren’t that many in this deck I have made.

 

I used Anki and then used the Chinese Support add-on to add definitions. I believe it uses CC-CEDICT, but it’s formatted in such a way that it’s only really usable in CN -> EN, or “passive recognition (i.e. reading)”. If you were to have the definition on the front it would be very difficult as there are sometimes 12-15 English definitions. It doesn’t pick the 3 most common ones, for example.

 

Jan Finster, is it possible that they NYT bestseller you are referring to is a translation EN -> CN? If so it isn’t an accurate reflection of 普通话 frequency, as translation is usually more simple in terms of lexical variety. I read 小王子 and it seems easier than it should have been, and is likely “simplified” translating a certain noun or verb into CN.

  • Like 1
Link to comment
Share on other sites

2 hours ago, Jan Finster said:


Screenshot_1.thumb.png.455f9cf9ed522a6fd8c7bd14bfaa83c4.pngScreenshot_2.thumb.png.268417b4f3576db8ff2c925b6abd33f6.png

Both 学生 & 阳光 are part of the lists you've linked.

 

 

2 hours ago, Jan Finster said:

Do you have the whole list as txt or xls so I can compare where the mistakes are?


Already linked them.

image.thumb.png.39422e19c4e48abf2efafd3280f5cb9a.png
 

Link to comment
Share on other sites

19 minutes ago, Weyland said:

Both 学生 & 阳光 are part of the lists you've linked.

 

Thanks! 

 

Then it is CTA`s fault or the lists were somewhat compromised before I included them in CTA . I basically copy/pasted all words of 普通话水平测试  into CTA and marked all as known. I saved this as a reference word list. Then I copy/pasted 我是学生 and  外面很阳光 into CTA and got the results I mentioned above. 

 

3 hours ago, Weyland said:

It isn't about it "being important", but rather about it being "various". You can define one proper noun, like a name or an invention and then use it from then on without it having to be "known" by anyone past or present. E.g. the town name of "Llanfairpwll-gwyngyllgogerychwyrndrob" in Wales, is a proper noun. Heck, every person's name on this forum is proper noun.

So in the future, before you "see what the overlap is" you might want to first prune that list of words with, I don't know, a dictionary. If it isn't in a popular, everyday, dictionary then it probably is either a proper noun or the program made a mistake when segmenting the words.

 

I think this is a valid remark, but it should not apply when you compare the word lists we are talking about. They should not contain a significant number of proper nouns. 

 

So,  if I cannot trust CTA´s comparison, then how do you have a method to determine how much overlap there is between:

HSK 1.0. (old HSK) vs 普通话水平测试  

HSK 3.0. vs 普通话水平测试  

HSK 2.0. vs 普通话水平测试  

 

Thanks!

 

Link to comment
Share on other sites

4 hours ago, Jan Finster said:

HSK 1.0. (old HSK) vs 普通话水平测试  

HSK 3.0. vs 普通话水平测试  

HSK 2.0. vs 普通话水平测试  

 

Based on Weyland's word lists, here the comparison back to back according to CTA. "Reference word list" (column) is set as 100% known and then HSK 1.0. (pre 2010), HSK 2.0. (2010-2020), HSK 3.0. (2021+) and PSC (Weyland's 普通话水平测试 ) is copy/pasted into CTA. The numbers stand for words not covered in the reference list:

 

    Reference Word List  
  HSK 1.0 HSK 2.0 HSK 3.0 PSC
HSK 1.0   4153 1673 896
HSK 2.0 567   490 435
HSK 3.0 3905 6308   2359
PSC 12564 14078 7596  

 

Link to comment
Share on other sites

I'd be cautious about how you interpret CTA's numbers: it's a super-useful tool but (unless it's improved recently) it doesn't do a perfect job of segmenting. That means it'll often tie two characters together as one (often very rare) word, rather than realising that the first character is, say, the last part of someone's name, and the second character belongs to another word or should be standing alone.

The result is it overestimates - I think - the number of rare vocabulary items ... and because they're so rare, you're unlikely to have studied them, making you more pessimistic than perhaps you should be about how much vocab you'll already know in a novel.

Best to use it as a relative figure: 'it says XX%, that puts it halfway between that really easy book I read last week and that tricky one from last month' for instance.

  • Like 2
Link to comment
Share on other sites

On 11/30/2020 at 8:30 AM, Jan Finster said:

I basically copy/pasted all words of 普通话水平测试  into CTA and marked all as known. I saved this as a reference word list.

This might be part of the problem - you're relying on CTA's segmenter when the words are already segmented by line. It would be better to import this list of words, rather than copy and paste and mark as known.

 

18 hours ago, realmayo said:

it doesn't do a perfect job of segmenting.

This is probably another part of the problem. If you're looking for more accurate segmentation (and don't mind if it takes a couple min to process a book, compared with less than a second in CTA), I recommend using the Stanford Segmenter to segment the text, the feeding the segmented text into CTA.

  • Like 1
Link to comment
Share on other sites

1 hour ago, Yadang said:

This might be part of the problem - you're relying on CTA's segmenter when the words are already segmented by line. It would be better to import this list of words, rather than copy and paste and mark as known.

 

19 hours ago, realmayo said:

it doesn't do a perfect job of segmenting.

This is probably another part of the problem. If you're looking for more accurate segmentation (and don't mind if it takes a couple min to process a book, compared with less than a second in CTA), I recommend using the Stanford Segmenter to segment the text, the feeding the segmented text into CTA.

@imronI would be curious as to what you, Imron, have to say about this. Also, based on how you created CTA, do you think the data I shown above is accurate or flawed?

On 11/30/2020 at 10:25 PM, Jan Finster said:

The numbers stand for words not covered in the reference list:

 

    Reference Word List  
  HSK 1.0 HSK 2.0 HSK 3.0 PSC
HSK 1.0   4153 1673 896
HSK 2.0 567   490 435
HSK 3.0 3905 6308   2359
PSC 12564 14078 7596  

 

Link to comment
Share on other sites

CTA trades accuracy for speed, and produces figures that are ballpark correct. 
 

This is useful enough for extracting frequently occurring unknown words and for comparing texts to get an idea of relative difficulty - the two main design goals of CTA. 

 

I would like to improve the segmenter, but doing so would involve a non-trivial amount of effort for little extra improvement in the two areas. 
 

I generally agree with what realmayo and yadang said. 

  • Helpful 1
Link to comment
Share on other sites

No.

 

The stanford segmenter is generally considered best-of-breed but:

 

1. It's licensed under the GPL which means I couldn't incorporate it in to CTA without also releasing the CTA (and its source) under the GPL

2. Even if I did that, it's written in Java where CTA is written in C++ and the languages are not compatible enough to make this a useful endeavour

3. Even if I did decide to glue them together (it's not impossible) the stanford segmenter is significantly slower than CTA and uses significantly more memory than CTA (both because of Java and because of the better segmentation algorithm), which would cause some problems because several parts of CTA require a fast segmenter to work properly (realtime highlighting of text).

4. I could use the same algorithm (a CRF word segmenter) and write my own implementation in C++ which would be faster and unencumbered by the GPL *however* that then falls back to what I said above about non-trivial effort for little improvements.

5. If I had the time to do such a thing (which I currently don't) that time would be better spent on other parts of CTA - e.g. the corpus feature, graphs of learnt words over time, etc which would provide greater value to users than a minor improvement to segmentation.

  • Helpful 1
Link to comment
Share on other sites

  • 5 weeks later...

histogram.thumb.png.e06905a7222eb25a161b7dffc60640f5.png

 

Credit goes to @alanmd at HSK东西 for the idea of plotting word frequencies in a stacked histogram.

Word frequency data is largely based on the gigantic 15-billion-character BLCU corpus, along with some supplementation from the Lancaster Corpus of Modern Chinese and the SUBTLEX-CH word frequency listings.

  • Like 3
Link to comment
Share on other sites

  • 2 months later...

That was a bit of an unusual situation, as those were rival exams being offered by different bodies (BLCU and Hanban). I'm not sure if you'll see HSK 2.0 and 3.0 offered at the same time in the same place, although there may well be chances to take HSK 3.0 while they're trialing it, or to pick and choose by switching between local exam centres during the roll-out. 

 

 

  • Helpful 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...