Jump to content
Chinese-Forums
  • Sign Up

Guide to reading Chinese fiction, from absolute beginner to beyond HSK 6


MoonIvy

Recommended Posts

On 1/25/2022 at 12:25 PM, Jan Finster said:

I would however not rank the levels according to known character count, but word count.

(If someone studied 3000 characters by Heisig method and/or Anki and was then asked to read a novel or news site, it would not be pretty at all.)

By character count isn't the best way, but we can't think of another as not everyone follows HSK, and using words like "beginner, intermediate" is also meaningless as well. If someone does decide to only study 3000 characters and nothing else then thinks they can read a novel, well ...... it's probably a good wake up call when they realise that's impossible. However, having said all that, do you have an idea how we can rank the content?

  • Like 1
Link to comment
Share on other sites

On 1/25/2022 at 12:30 PM, Moshen said:

Also, is there a way to download the document?  I don't normally use Google docs and did not see a download link.

You might be able to save it as a PDF by going to File > Print > Save as PDF. We often update the resources (we find better ones or some gets taken down) so I would reocmmend to come back and check on it very now and then.

Link to comment
Share on other sites

Looks really good! I'll have to take a really good look!

 

 

On 1/25/2022 at 2:30 PM, MoonIvy said:

However, having said all that, do you have an idea how we can rank the content?

 

How about a word frequency?
You could take a frequency list like this: https://www.plecoforums.com/threads/word-frequency-list-based-on-a-15-billion-character-corpus-bcc-blcu-chinese-corpus.5859/

 

And then analyze each text and calculate the weighted average frequency of all the words in the text. The end result would be a score that tells you the relative "easiness" of the text based on the word frequencies. The more there are frequently used words in the text, the greater the score and, in theory, the easier the text.

Link to comment
Share on other sites

@Jan Finsterah I misread your message, you suggested to rank it by word count. We thought about this when we first started the doc and came across the following problem.

When analysising Chinese text, every variation counts as a word, for example, 第一次,第二次,第三次, 一棵树,两棵树 a computer would count all these are different words, however as a learner, you don't need to know them all individually. Another example is 走 走着 走着走着, a computer would count all these as different words.

We had previously analysised many native webnovels, we found the word count to be extremely high (150k-300k), even children's books for 6-7 years olds were around 5-8k. If we were to change the ranks, it'll be something along lines of 5,000 50,000 100,000 200,000+ They're really high numbers, and I feel that'll just scare people. 

We had a word count column in our reading resource: https://docs.google.com/spreadsheets/d/e/2PACX-1vTRuZcZySCSm6NRzXpXKbjp6KX5vWlqndQVNNYsmvpE9nJNpcYC9G-A8nt2BVhPdc8vzg6BRz2HuYyx/pubhtml# but the numbers were so high that we eventually got rid of it (besides the 儿童 section) because it just looked so scary to see 150k-300k.

 

Also not many people will keep an accurate list of their known words (they need some sort of tool for that, which not everyone has one), and most people will probably go by the number of words they have in their flashcard deck, which doesn't reflect the actual amount of words they know as many words are learnt naturally from consumming content (or they are variations like 第一次,第二次,第三次, 星期一,星期二)

It's why we settled on character count as people can use tools to figure out a super rough estimate on how many characters they know. 

This is an issue we're pondered on for a while, and we had previously gone back and forth with this. Maybe a written description of what it means to be on each level, the sort of content they can comfortably consume?

At the moment, the resources that sit in each section are placed there based on experience from myself and a few others. 

  • Helpful 1
Link to comment
Share on other sites

On 1/25/2022 at 10:39 PM, alantin said:

And then analyze each text and calculate the weighted average frequency of all the words in the text. The end result would be a score that tells you the relative "easiness" of the text based on the word frequencies. The more there are frequently used words in the text, the greater the score and, in theory, the easier the text.

Too complicated if you ask me. The Flesch readability score for English takes only three easy-to-obtain numbers: total sentences, total words, total syllables.

 

As a side note, high number of one-syllabe words as well as four-syllable words may indicate archaic, semi-classical style, so word length won't work in Chinese as it does in English.

 

If only someone could develop a method, or better still, build a website, that works for Chinese!

 

On 1/25/2022 at 8:30 PM, MoonIvy said:
On 1/25/2022 at 8:25 PM, Jan Finster said:

I would however not rank the levels according to known character count, but word count.

(If someone studied 3000 characters by Heisig method and/or Anki and was then asked to read a novel or news site, it would not be pretty at all.)

Expand  

By character count isn't the best way, but we can't think of another as not everyone follows HSK, and using words like "beginner, intermediate" is also meaningless as well. If someone does decide to only study 3000 characters and nothing else then thinks they can read a novel, well ...... it's probably a good wake up call when they realise that's impossible.

Yeah I reached the same conclusion when I started the short-lived First Chapter Project. There's not only the problem of word segmentation, there's also the fact that Chinese school system actually uses character count as goals. Of course native kids aren't learning characters the Heisig way. They're asked to use the character they learned in a word (组词) and use the word in a sentence (造句), not to mention copying out the character by hand - all the things an adult learner hates. So if you have learned characters properly the number of known characters is a good indicator of your literacy level.

  • Like 1
Link to comment
Share on other sites

On 1/25/2022 at 6:50 PM, phills said:

This is very similar to the concept of information "entropy" or "surprisal".  Basically you measure how surprised you are to find a character/word of a certain frequency in the book.  If words that are supposed to be common turn up in a book, it's not much of a surprise.  If they're supposed to be rare but appear often, it's a "surprise."  Mathematically, it's measured by using the negative log of the expected frequency.

 

Wow! I didn't know that concept when I wanted to compare the difficulty of the chapters in my books and came up with that weighted average. I personally do it at character level though because there are a lot less of them to compare. I'll have to look into surprisal and change my measures!

  • Like 1
Link to comment
Share on other sites

On 1/25/2022 at 4:43 PM, MoonIvy said:


We had previously analysised many native webnovels, we found the word count to be extremely high (150k-300k), even children's books for 6-7 years olds were around 5-8k. If we were to change the ranks, it'll be something along lines of 5,000 50,000 100,000 200,000+ They're really high numbers, and I feel that'll just scare people. 

 

I am not sure how you get to 150-300K individual words (???) (the total word count is not releavant)

 

I could imagine you could go in steps of 2.5K, 5K, 10K, 15K, 20K and 25k+ unique word for the levels. This is of course not an exact science, but number like these put up on this forum again and again (e.g. https://www.chinese-forums.com/forums/topic/61248-reading-material-chasm/?do=findComment&comment=480572). I believe Imron also said something along the line that of 5K to start, 10K for easy novels and 20K+ for pretty much the rest (I hope I am remembering this correctly, otherwise I apologise for misquoting)

 

On 1/25/2022 at 4:43 PM, MoonIvy said:

When analysising Chinese text, every variation counts as a word, for example, 第一次,第二次,第三次, 一棵树,两棵树 a computer would count all these are different words, however as a learner, you don't need to know them all individually. Another example is 走 走着 走着走着, a computer would count all these as different words

 I know, but does this really matter? Such words might constitute less than 5% of all unique words and they should average themselves out across the levels as you learn more words. In other words, you acknowledge them as a source of error, but this error is similar across all levels (maybe a bit higher at the 0-5K word level).

 

On 1/25/2022 at 5:09 PM, Publius said:

there's also the fact that Chinese school system actually uses character count as goals. Of course native kids aren't learning characters the Heisig way. They're asked to use the character they learned in a word (组词) and use the word in a sentence (造句), not to mention copying out the character by hand - all the things an adult learner hates. So if you have learned characters properly the number of known characters is a good indicator of your literacy level.

Yes, in that particular setting, I believe character count makes sense. For Chinese as a second language learners, it probably does not.

 

On 1/25/2022 at 5:09 PM, Publius said:

The Flesch readability score for English takes only three easy-to-obtain numbers: total sentences, total words, total syllables.

I wonder, if sentence length also plays a role in Chinese?

 

(In German it certainly does and we are famous for creating those long-winding and somewhat confusing word strings with nested subclauses, but not only that, even nested subclauses within nested subclauses, so that at the end of a sentence you do not really know how it started, but, if you do, you can consider yourself equal to the author Thomas Mann, who was famous for creating such long sentences and whose books challenge the minds of the TikTok generation...  (you get the gist)?)

 

  • Like 2
Link to comment
Share on other sites

The Chinese also seem to enjoy writing pagefulls of text using only commas as punctuation....
But it's not the same as German. In their case they just seem to indicate that what's being said next is still related to what was said before the comma.

Link to comment
Share on other sites

On 1/25/2022 at 7:33 PM, Jan Finster said:

I am not sure how you get to 150-300K individual words (???) (the total word count is not releavant)

Oh sorry, my bad! Was looking at the wrong data. Native content is around 15k-30k, so your suggestion could work. Do you know of a tool that people can use to work out roughly how many words they know? 

 

On 1/25/2022 at 4:50 PM, phills said:

I didn't feel comfortable starting reading native materials until 2500, and didn't feel like I was confident until >3500.

What sort of content did you read? Myself and members of our Discord have been reading Chinese webnovels since around 2k characters. We've found some relatively easy, modern, slice of life webnovels. 

Link to comment
Share on other sites

On 1/25/2022 at 9:19 PM, MoonIvy said:

Do you know of a tool that people can use to work out roughly how many words they know? 

 

Chinese Text Analyzer (by Imron (one of the mods)) would be the obvious suggestion. I guess, anyone, who is serious enough about Chinese to learn 5K+ plus words in Chinese, can spend 15$ on this great tool.

Link to comment
Share on other sites

On 1/25/2022 at 3:43 PM, MoonIvy said:

At the moment, the resources that sit in each section are placed there based on experience from myself and a few others. 

 

Perfect! If there was a useful way to quantify difficulty precisely, you or someone else would have done it already. So just go with your judgement and your experience, and perhaps explain your reasoning - but don't get sucked into too much left-brain analysis. People who are really axed to do the analysis will have their own preferred tools and parameters.

 

On 1/25/2022 at 8:43 PM, Jan Finster said:

Chinese Text Analyser (by Imron (one of the mods)) would be the obvious suggestion

 

Agreed.

Link to comment
Share on other sites

welp. i just followed your desktop OCR guide, and wanna say. THANK YOU!!!

 

I've been searching for a few weeks for something like pleco ocr and actually used the pleco live feature to read characters on my PC.... 

This seems especially useful for games, an area where my knowledge of vocab is especially poor

  • Like 1
Link to comment
Share on other sites

On 1/25/2022 at 8:50 PM, realmayo said:

Perfect! If there was a useful way to quantify difficulty precisely, you or someone else would have done it already. So just go with your judgement and your experience, and perhaps explain your reasoning - but don't get sucked into too much left-brain analysis. People who are really axed to do the analysis will have their own preferred tools and parameters.

Yeah it is hard to divide up levels, even within the levels each person's ability to comprehend the content will vary.

Link to comment
Share on other sites

On 1/26/2022 at 3:33 AM, Jan Finster said:

Yes, in that particular setting, I believe character count makes sense. For Chinese as a second language learners, it probably does not.

Yet they aim to one day read like a native. When in Rome... use hand gestures, is all I'm saying.

 

On 1/26/2022 at 3:33 AM, Jan Finster said:

we are famous for creating those long-winding and somewhat confusing word strings with nested subclauses, but not only that, even nested subclauses within nested subclauses, so that at the end of a sentence you do not really know how it started

Mark Twain summed it up pretty neatly. My first encounter with the German language was in the late 80s, when West German embassy used to give out free textbooks (Auf Deutsch Gesagt, Familie Baumann) if asked politely. When I ventured into the literary world, however, I was like, wtf, don't shoot! I surrender! (Young Werther was the culprit, I believe.)

Link to comment
Share on other sites

On 1/26/2022 at 4:19 AM, MoonIvy said:

What sort of content did you read? Myself and members of our Discord have been reading Chinese webnovels since around 2k characters. We've found some relatively easy, modern, slice of life webnovels. 

 

I was halfway through the HSK6 character list (which goes up to ~2700), when I started reading 活着.  I was somewhere in the low/mid 2000s.  I decided to pre-memorize all the unknown chars in the book (about ~250), before I started, which took me up to around 2500.

 

Then I tried reading other books, and I decided to pre-memorize the ~100-200 unknown chars in each book before I started.  Basically I did this until I got to about 3500 chars (~4 books), then it wasn't worth the effort anymore.

 

I never tried web-novels.  Also, perhaps you could know fewer chars if you can tolerate more ambiguity.  Even after 4 books, I could still find ~100 unknown chars in each new book, but I decided to just figure them out in context as they came along.  But to get a comfortable cushion of characters, suitable for lots of different texts, I think you need 2500-3500.

 

At least for me, I get frustrated when people under-estimate the number of characters you need to know.  I suppose they want to make reading seem more accessible.  But the fewer characters you know, the harder the slog when you first read.  If you've already memorized 2000, another 500 isn't too much more.  I'd rather know more chars ahead of time, and have an easier time reading, then interrupt my reading constantly with a dictionary.  That could just be my subjective preference though (I prefer extensive > intensive reading).  Some people are more tenacious about getting through the slog.

 

I suppose you talk about the same thing in your "Reading Pain" section.  Your 2000/3000 numbers are set at the Reading Pain level rather than Intensive or Extensive Reading level :) I'd boost it by 500 if you want less Reading Pain when you start.

 

On 1/26/2022 at 4:19 AM, MoonIvy said:

Do you know of a tool that people can use to work out roughly how many words they know? 


http://www.zhtoolkit.com/posts/tools/  does that, but strangely it seems busted now.  I tried it a few months ago, and it worked.  Maybe there's another site hosting it, or it'll recover by itself when they notice the bug.

 

Here's a nice graph from that site, summarizing test results:

 

http://www.zhtoolkit.com/posts/2011/06/skill-levels-quantified/

graph-estknown-vs-skill.png

  • Helpful 1
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...