Jump to content
Chinese-forums.com
Learn Chinese in China

Friday

How to measure the difficulty of a text?

Recommended Posts

Friday

I want to find some way to measure the difficulty level of some Chinese eBooks arrange them from easiest to most difficult.

Here is what I tried:

  1. I ran the text in a word segmenting software to identify each unique word.
  2. In the text, I replaced all HSK 1 words with "1", all HSK 2 words with "2", etc., and all words not in the HSK list with "7".
  3. I removed all remaining symbols. This resulted in a document containing only numbers, from 1 to 7.
  4. I then calculated the mean average of all of the numbers in the document.

I thought this would give a rough estimate of the difficulty level. However, the calculations all came out very close to 3.5. There was little difference in the numbers between a children's story that I can easily read and a challenging novel.
 

Is there some way to improve this? Or a better process for determining the relative difficulty?

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

querido

It looks like you won't get as fine an ordering as you aimed for, so some coarser measure, which should be easier to calculate, should be good enough. If you feed Chinese Text Analyser, recommended above, a list of the words you already know, and have it analyse the text, it will give you a word count and a list of the words you don't know. Then a simple measurement of difficulty, "% unknown words" sounds pretty good.

 

Hey, don't waste too much time ordering the texts. :-)

Share this post


Link to post
Share on other sites
HerrPetersen

I think your step 4 is not strict enough. I would test different algorithms for step 4 until things line up with your expectations.

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

  • Like 2

Share this post


Link to post
Share on other sites
roddy

Look also at sentence / clause length. Total vocabulary items (your approach will rate a text with every HSK4 word in as equally hard as a text with ten HSK4 words repeated over and over). Word length. Paragraph length. Range of structures.

  • Like 1

Share this post


Link to post
Share on other sites
imron

The other thing that Chinese Text Analyser does is keep track of your known vocabulary.

 

It's all very well to sort by HSK lists, but that might not be a reflection of difficulty for you - especially as you start reading native content and your vocab starts to diverge from what you might find in the HSK lists.

 

One of the design goals of CTA was to allow you to have a reasonably fast and accurate way to asses whether a given text is appropriate for your level.

 

By keeping track of your known vocabulary, CTA can very quickly (e.g. less than a second for most books) give you an approximate number of unknown words in the text (I say approximate, because minor errors with the segmentation will throw off the count slightly).

 

You can also use it to export lists of unknown words (sorted by frequency and/or first occurrence), which can then be used to study words in advance of reading.

 

*disclaimer, I'm the developer of Chinese Text Analyser

Share this post


Link to post
Share on other sites
edelweis

I don't think the new HSK lists are appropriate for the process described in the op.

The lower levels contain too few words to be useful with native contents, that's why it averages at 3.5.

But maybe they could work better with the various levels of the Chinese Breeze series?

You'd need different lists for native content I think. Perhaps try the old HSK lists.

edit: the new hsk lists don't have words such as 星期天! useless.

Share this post


Link to post
Share on other sites
Silent

My approach is the number of distinct (unknown) vocabulary items relative to length of the text. This is a very inaccurat result, but there is some correlation with difficulty. I think that if you combine it with sentence/clause length as Roddy suggest you should get pretty decent results.

 

It's however not an exact science. Specially when the sentence length and unknown vocabulary is uneven distributed reality may be quite different. Generally for example dialog is relative easy while narrative is often a bit harder. Specialist subjects may contain relatively small, but subject specific rare vocabulary etc.

Share this post


Link to post
Share on other sites
c_redman

I've made attempts at a worthwhile method, but my best result is a very rough approximation. The biggest issue is that while it does come down to unknown words, sometimes those words are unimportant fluff, and sometimes a single word is the one critical item that is the key to understanding a whole paragraph.

The OP formula isn't a bad try. You would have gotten better granularity using 10^level rather than just the raw HSK level, since the levels are for different windows of word frequency, which decrease exponentially. But the biggest problem is that many words aren't in the HSK so they can't be ranked.

In the US, various algorithms are used to grade reading levels. One common one is Lexiles. According to The Origin of the Lexile Specification Equation, the formula is:

Theoretical Logit = (9.82247*LMSL)-(2.14634*MLWF)

where LMSL = log of the mean sentence length and MLWF = mean of the log word frequences. LMSL and MLWF are used as proxies for syntactic complexity and semantic demand. (Stenner & Burdick, 1997).

This raw score gets plugged into a normalization formula to get the official Lexile score of 0 to 2000+. So, it's basically a weighted combination of sentence length and word frequency.

Flesch–Kincaid is another well-known one. However, it includes number of syllables as a proxy for word difficulty, which may not apply so well to Chinese. Of the methods listed in Wikipedia, most of them have number of syllables or letters in a word as a factor.

I will throw in one final reference, which is from Donald Hayes, who has come up with his own measures. I have yet to find a formula for his "LEX" computed readability. However, a related measure is his "MeanU" number, which is the average word frequency of all words in the text. This is an easy number to calculate, if you are using text analysis software that will list the corpus frequencies for every word.

  • Like 2

Share this post


Link to post
Share on other sites
Jan Finster

I hope it is ok to piggyback this question on this thread. Regarding  the difficulty of a given text:

Does the "95% or 97% known words recommendation" refer to "% of all words" or "% unique words" (CTA)?

Share this post


Link to post
Share on other sites
Luxi

I'm with Roddy on this.

 

On 10/4/2015 at 5:43 PM, roddy said:

Look also at sentence / clause length....

 

... ... etc., etc., etc. If only it was a question of vocabulary! 

Vocabulary is the easy part with all the electronic aids at hand. The rest of it isn't so easy, one only really knows one can read a book when reading it. I have a  heap of abandoned reading attempts, waiting for the right time, mood and circumstances. But it's not just unknown words. See here:

 

Quote

她们说。她们以为自己是王。她们嘱我跟她们去看屋子,我去了。我看见屋子,它和它的那些房子朋友们排了一种它们自家高兴排的队,占满整条大街的两边,如一座林。大屋它独个儿凹在一个角落上,别的房子高,它矮;别的房子瘦,它胖;别的房子开朗活泼,它笨,又呆。这,我想起来了,它完全如同我阿果。它正在睡觉,我由得它去睡。天气不冷,但它缩做一团,灰色的外石墙,有如裹了一件厚极了的粗呢外套,加上麻点子的绒毛围巾,以及手套,以及袜子。屋子的楼下有铁闸,由五把锁把守在一起。闸内有大门,门上是弹簧锁。门内的一边是楼梯,每一级上可以让五个我并排挤在一起坐。

 

 

Can you understand this paragraph without reading it several times?

Known words? Probably a high % for intermediate students, 75 - 80%?  It's only the 2nd or 3rd paragraph in Chapter 1 of 《我城》by the Hong Kong writer 西西. She writes like a child, uses simple, common words, no allusions or literary convolutions. But, does she play with the language? 

 

Most e-book sites have free samples that one can read even without opening an account,  these can be quite generous, quite enough to check whether the book is up to one's standard (or patience). 

 

 

  • Like 1

Share this post


Link to post
Share on other sites
imron
21 hours ago, Jan Finster said:

Does the "95% or 97% known words recommendation" refer to "% of all words" or "% unique words" (CTA)?

All words.  It’s a measure of how much you’ll understand paragraph to paragraph when reading the text. 

  • Helpful 2

Share this post


Link to post
Share on other sites
Jan Finster
7 minutes ago, imron said:

All words.  It’s a measure of how much you’ll understand paragraph to paragraph when reading the text. 

 

Thank you! I could not really find the answer anywhere, but what you say makes a lot of sense.

 

Still, reading a book with 4% unknown words can boil down to 3000 new unique words 🙈

 

Here is one, I had in mind (沃顿商学院最受欢迎的谈判课):

    Total    183.829
    Known    176.322
    Percent Known    95,92%
    Unknown    7.507
    Percent Unknown    4,08%
    Unique    11.136
    Known    8.087
    Percent Known    72,62%
    Unknown    3.049
    Percent Unknown    27,38%
 

Piece of cake😅

Share this post


Link to post
Share on other sites
imron

What happens to unique known when you take out all the unknown words that only appear once?

  • Helpful 1

Share this post


Link to post
Share on other sites
Jan Finster
1 hour ago, imron said:

What happens to unique known when you take out all the unknown words that only appear once?

 

That is another good point. Leaves me with about 1000 words and only about 400 words of them occur 4 or more times 😊

  • Like 1

Share this post


Link to post
Share on other sites
imron

Most Chinese text will have a long tail of low frequency words that won’t have a huge impact on understanding if you only see one of them a paragraph.

 

You can usually safely ignore these words until you are reading a different text that uses those words with more frequency. 

Share this post


Link to post
Share on other sites
Lu
On 10/4/2015 at 3:40 PM, HerrPetersen said:

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

Keep this in mind.

 

Using CTA and calculating words and whatnot is all good and well, but be careful that you don't spend more time analysing than just picking up a text, starting to read it, and putting it aside if it proves too difficult. This analysing can be a version of the Textbook Pitfall, where a prospective learner keeps searching for the perfect textbook instead of just starting to study with any reasonably good one.

  • Like 3

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...