Jump to content
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 

How to measure the difficulty of a text?

Recommended Posts


I want to find some way to measure the difficulty level of some Chinese eBooks arrange them from easiest to most difficult.

Here is what I tried:

  1. I ran the text in a word segmenting software to identify each unique word.
  2. In the text, I replaced all HSK 1 words with "1", all HSK 2 words with "2", etc., and all words not in the HSK list with "7".
  3. I removed all remaining symbols. This resulted in a document containing only numbers, from 1 to 7.
  4. I then calculated the mean average of all of the numbers in the document.

I thought this would give a rough estimate of the difficulty level. However, the calculations all came out very close to 3.5. There was little difference in the numbers between a children's story that I can easily read and a challenging novel.

Is there some way to improve this? Or a better process for determining the relative difficulty?

Share this post

Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.


It looks like you won't get as fine an ordering as you aimed for, so some coarser measure, which should be easier to calculate, should be good enough. If you feed Chinese Text Analyser, recommended above, a list of the words you already know, and have it analyse the text, it will give you a word count and a list of the words you don't know. Then a simple measurement of difficulty, "% unknown words" sounds pretty good.


Hey, don't waste too much time ordering the texts. :-)

Share this post

Link to post
Share on other sites

I think your step 4 is not strict enough. I would test different algorithms for step 4 until things line up with your expectations.

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

  • Like 1

Share this post

Link to post
Share on other sites

Look also at sentence / clause length. Total vocabulary items (your approach will rate a text with every HSK4 word in as equally hard as a text with ten HSK4 words repeated over and over). Word length. Paragraph length. Range of structures.

Share this post

Link to post
Share on other sites

The other thing that Chinese Text Analyser does is keep track of your known vocabulary.


It's all very well to sort by HSK lists, but that might not be a reflection of difficulty for you - especially as you start reading native content and your vocab starts to diverge from what you might find in the HSK lists.


One of the design goals of CTA was to allow you to have a reasonably fast and accurate way to asses whether a given text is appropriate for your level.


By keeping track of your known vocabulary, CTA can very quickly (e.g. less than a second for most books) give you an approximate number of unknown words in the text (I say approximate, because minor errors with the segmentation will throw off the count slightly).


You can also use it to export lists of unknown words (sorted by frequency and/or first occurrence), which can then be used to study words in advance of reading.


*disclaimer, I'm the developer of Chinese Text Analyser

Share this post

Link to post
Share on other sites

I don't think the new HSK lists are appropriate for the process described in the op.

The lower levels contain too few words to be useful with native contents, that's why it averages at 3.5.

But maybe they could work better with the various levels of the Chinese Breeze series?

You'd need different lists for native content I think. Perhaps try the old HSK lists.

edit: the new hsk lists don't have words such as 星期天! useless.

Share this post

Link to post
Share on other sites

My approach is the number of distinct (unknown) vocabulary items relative to length of the text. This is a very inaccurat result, but there is some correlation with difficulty. I think that if you combine it with sentence/clause length as Roddy suggest you should get pretty decent results.


It's however not an exact science. Specially when the sentence length and unknown vocabulary is uneven distributed reality may be quite different. Generally for example dialog is relative easy while narrative is often a bit harder. Specialist subjects may contain relatively small, but subject specific rare vocabulary etc.

Share this post

Link to post
Share on other sites

I've made attempts at a worthwhile method, but my best result is a very rough approximation. The biggest issue is that while it does come down to unknown words, sometimes those words are unimportant fluff, and sometimes a single word is the one critical item that is the key to understanding a whole paragraph.

The OP formula isn't a bad try. You would have gotten better granularity using 10^level rather than just the raw HSK level, since the levels are for different windows of word frequency, which decrease exponentially. But the biggest problem is that many words aren't in the HSK so they can't be ranked.

In the US, various algorithms are used to grade reading levels. One common one is Lexiles. According to The Origin of the Lexile Specification Equation, the formula is:

Theoretical Logit = (9.82247*LMSL)-(2.14634*MLWF)

where LMSL = log of the mean sentence length and MLWF = mean of the log word frequences. LMSL and MLWF are used as proxies for syntactic complexity and semantic demand. (Stenner & Burdick, 1997).

This raw score gets plugged into a normalization formula to get the official Lexile score of 0 to 2000+. So, it's basically a weighted combination of sentence length and word frequency.

Flesch–Kincaid is another well-known one. However, it includes number of syllables as a proxy for word difficulty, which may not apply so well to Chinese. Of the methods listed in Wikipedia, most of them have number of syllables or letters in a word as a factor.

I will throw in one final reference, which is from Donald Hayes, who has come up with his own measures. I have yet to find a formula for his "LEX" computed readability. However, a related measure is his "MeanU" number, which is the average word frequency of all words in the text. This is an easy number to calculate, if you are using text analysis software that will list the corpus frequencies for every word.

  • Like 2

Share this post

Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...