Jump to content
Learn Chinese in China

How to measure the difficulty of a text?


Recommended Posts

I want to find some way to measure the difficulty level of some Chinese eBooks arrange them from easiest to most difficult.

Here is what I tried:

  1. I ran the text in a word segmenting software to identify each unique word.
  2. In the text, I replaced all HSK 1 words with "1", all HSK 2 words with "2", etc., and all words not in the HSK list with "7".
  3. I removed all remaining symbols. This resulted in a document containing only numbers, from 1 to 7.
  4. I then calculated the mean average of all of the numbers in the document.

I thought this would give a rough estimate of the difficulty level. However, the calculations all came out very close to 3.5. There was little difference in the numbers between a children's story that I can easily read and a challenging novel.

Is there some way to improve this? Or a better process for determining the relative difficulty?

Link to comment
Share on other sites

Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

It looks like you won't get as fine an ordering as you aimed for, so some coarser measure, which should be easier to calculate, should be good enough. If you feed Chinese Text Analyser, recommended above, a list of the words you already know, and have it analyse the text, it will give you a word count and a list of the words you don't know. Then a simple measurement of difficulty, "% unknown words" sounds pretty good.


Hey, don't waste too much time ordering the texts. :-)

Link to comment
Share on other sites

I think your step 4 is not strict enough. I would test different algorithms for step 4 until things line up with your expectations.

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

  • Like 2
Link to comment
Share on other sites

The other thing that Chinese Text Analyser does is keep track of your known vocabulary.


It's all very well to sort by HSK lists, but that might not be a reflection of difficulty for you - especially as you start reading native content and your vocab starts to diverge from what you might find in the HSK lists.


One of the design goals of CTA was to allow you to have a reasonably fast and accurate way to asses whether a given text is appropriate for your level.


By keeping track of your known vocabulary, CTA can very quickly (e.g. less than a second for most books) give you an approximate number of unknown words in the text (I say approximate, because minor errors with the segmentation will throw off the count slightly).


You can also use it to export lists of unknown words (sorted by frequency and/or first occurrence), which can then be used to study words in advance of reading.


*disclaimer, I'm the developer of Chinese Text Analyser

Link to comment
Share on other sites

I don't think the new HSK lists are appropriate for the process described in the op.

The lower levels contain too few words to be useful with native contents, that's why it averages at 3.5.

But maybe they could work better with the various levels of the Chinese Breeze series?

You'd need different lists for native content I think. Perhaps try the old HSK lists.

edit: the new hsk lists don't have words such as 星期天! useless.

Link to comment
Share on other sites

My approach is the number of distinct (unknown) vocabulary items relative to length of the text. This is a very inaccurat result, but there is some correlation with difficulty. I think that if you combine it with sentence/clause length as Roddy suggest you should get pretty decent results.


It's however not an exact science. Specially when the sentence length and unknown vocabulary is uneven distributed reality may be quite different. Generally for example dialog is relative easy while narrative is often a bit harder. Specialist subjects may contain relatively small, but subject specific rare vocabulary etc.

Link to comment
Share on other sites

I've made attempts at a worthwhile method, but my best result is a very rough approximation. The biggest issue is that while it does come down to unknown words, sometimes those words are unimportant fluff, and sometimes a single word is the one critical item that is the key to understanding a whole paragraph.

The OP formula isn't a bad try. You would have gotten better granularity using 10^level rather than just the raw HSK level, since the levels are for different windows of word frequency, which decrease exponentially. But the biggest problem is that many words aren't in the HSK so they can't be ranked.

In the US, various algorithms are used to grade reading levels. One common one is Lexiles. According to The Origin of the Lexile Specification Equation, the formula is:

Theoretical Logit = (9.82247*LMSL)-(2.14634*MLWF)

where LMSL = log of the mean sentence length and MLWF = mean of the log word frequences. LMSL and MLWF are used as proxies for syntactic complexity and semantic demand. (Stenner & Burdick, 1997).

This raw score gets plugged into a normalization formula to get the official Lexile score of 0 to 2000+. So, it's basically a weighted combination of sentence length and word frequency.

Flesch–Kincaid is another well-known one. However, it includes number of syllables as a proxy for word difficulty, which may not apply so well to Chinese. Of the methods listed in Wikipedia, most of them have number of syllables or letters in a word as a factor.

I will throw in one final reference, which is from Donald Hayes, who has come up with his own measures. I have yet to find a formula for his "LEX" computed readability. However, a related measure is his "MeanU" number, which is the average word frequency of all words in the text. This is an easy number to calculate, if you are using text analysis software that will list the corpus frequencies for every word.

  • Like 2
Link to comment
Share on other sites

  • 5 years later...

I'm with Roddy on this.


On 10/4/2015 at 5:43 PM, roddy said:

Look also at sentence / clause length....


... ... etc., etc., etc. If only it was a question of vocabulary! 

Vocabulary is the easy part with all the electronic aids at hand. The rest of it isn't so easy, one only really knows one can read a book when reading it. I have a  heap of abandoned reading attempts, waiting for the right time, mood and circumstances. But it's not just unknown words. See here:






Can you understand this paragraph without reading it several times?

Known words? Probably a high % for intermediate students, 75 - 80%?  It's only the 2nd or 3rd paragraph in Chapter 1 of 《我城》by the Hong Kong writer 西西. She writes like a child, uses simple, common words, no allusions or literary convolutions. But, does she play with the language? 


Most e-book sites have free samples that one can read even without opening an account,  these can be quite generous, quite enough to check whether the book is up to one's standard (or patience). 



  • Like 1
Link to comment
Share on other sites

7 minutes ago, imron said:

All words.  It’s a measure of how much you’ll understand paragraph to paragraph when reading the text. 


Thank you! I could not really find the answer anywhere, but what you say makes a lot of sense.


Still, reading a book with 4% unknown words can boil down to 3000 new unique words 🙈


Here is one, I had in mind (沃顿商学院最受欢迎的谈判课):

    Total    183.829
    Known    176.322
    Percent Known    95,92%
    Unknown    7.507
    Percent Unknown    4,08%
    Unique    11.136
    Known    8.087
    Percent Known    72,62%
    Unknown    3.049
    Percent Unknown    27,38%

Piece of cake😅

Link to comment
Share on other sites

On 10/4/2015 at 3:40 PM, HerrPetersen said:

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

Keep this in mind.


Using CTA and calculating words and whatnot is all good and well, but be careful that you don't spend more time analysing than just picking up a text, starting to read it, and putting it aside if it proves too difficult. This analysing can be a version of the Textbook Pitfall, where a prospective learner keeps searching for the perfect textbook instead of just starting to study with any reasonably good one.

  • Like 4
Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Create New...