idea for finding chinese texts at the appropriate level

February 11, 2014 at 06:34 PM

I hope this hasnt been covered elsewhere already. I use Skritter to learn characters (supplemented with handwriting and Quizlet). Skritter allows you to export your list of characters and words, and I believe their API also lets you do this. I was thinking it would be cool to have an app or program that takes in your known words, and scans texts to find ones where you know a certain percentage of the characters already, so you can read texts with the right amount of "flow". I know I could write this software pretty quickly, but the biggest problem is a source of Chinese writings where they don't mind if I go scrape their site and put them in some sort of database.

I was thinking it might be fun, especially if I could search a large range of texts. Of course now that I think about it, the program might take forever to run, as it would have to scan a ton of characters.

Another way for it to be faster to run (I'm thinking "aloud" here), would be to let the user input a paragraph or two of text. I could generate a vocabulary list that you should study in order to be able to read the paragraph more smoothly. I'm curious if anyone thinks this is interesting or if there is already some sort of similar program available. Of course I could go grab a paragraph from cn.nytimes.com, figure out all the characters I don't know, and then learn them, but i was thinking something more automated would be nicer and take less time.

-Rachel

February 12, 2014 at 12:53 AM

I've been thinking about something alike, but my programming skills are extremely basic and it would take me forever to create such an app as I imagine it should be.

I was thinking it might be fun, especially if I could search a large range of texts. Of course now that I think about it, the program might take forever to run, as it would have to scan a ton of characters.

This is true for the large majority of segmentation apps/websites. I recently had a look at an app Imron wrote which is fast and smooth even for large texts. See: http://www.chinesetextanalyser.com/

If you build a database with articles speed wouldn't be the biggest issue as you could help it a lot by creating the vocabulary/frequency lists beforehand. The only thing you would have to do is compare the vocabulary list of the article with the known vocabulary. Names however can be a pain in generating vocabulary lists.

Entering only a few paragraphs is not the solution as vocabulary may vary a lot over the paragraphs. E.g. a book about a traveler may start with some family scene followed by a scene about taking a flight and passing customs, then extensive description of the fabulous scenery followed by a description of car trouble, the description of a horrible hotel/restaurant and of course describe conversations with people the traveler meets about the most elaborate and weird subjects each paragraph could have some quite specific vocabulary.

I'm no expert on the matter, but I think there can't be any (legal) objection to scraping websites for text as long as you don't publish their articles. You might for example only publish the meta data/recommendations with a link to the article and possibly a quote. That way it would legally be no different from any other search engine.

There are already websites with functionality that goes in this direction however I've not come across any that really has the features I've dreamed up.

February 12, 2014 at 06:02 AM

If you're interested in this sort of thing there was a thread just a couple days ago about a guy doing roughly the same thing but from a different angle (collecting texts around your level) here: http://www.chinese-forums.com/index.php?/topic/43681-chineselevelcom-intro/#comment-326359

Might be work looking at whether you're an interested user or developer.

February 12, 2014 at 06:55 AM

I recently had a look at an app Imron wrote which is fast and smooth even for large texts. See: http://www.chinesetextanalyser.com/

People are welcome to check this out, but I should note that this is very much still in early development and is lacking a number of features as well as any documentation (I'll be making a longer introduction post once it's more ready for use). Silent is right though in that the one thing it does do well at the moment is handle large texts - even GBs of text is no problem.

February 12, 2014 at 08:17 AM

Though they are not exactly what you're looking for, because they all start from a different point of view, you may be interested in programmes like:

- "Chinese Word Extractor". Compared with your idea, it works the other way around: you can analyse a text to extract the vocabulary, but you can also filter the analysis using stop lists of known words. So, in theory, the number of words left gives you an idea of the difficulty of the text. If I'm not mistaken, the author of the software is a member of this forum, so he may chime in and give better explanations.

- "Learning with Texts". It analyses a text and, using a database of words as a backend, it shows in colours which words are known, unknown or in between. So, again, you can have an idea of the theoretical/purely lexical difficulty of a text. LWT also contains an SRS, an audio tool, etc. I've used it a lot myself, and I believe it's very useful. See this topic.

- Lingq. I've never used it, so I can't comment, but I think, basically, it was the inspiration for LWT.

February 12, 2014 at 11:57 AM

The idea in the first paragraph of the OP is great: given *my* known word list and a preferred "% known", find me some graded reading (or report the % known of a given file or web page). Someone should write it immediately. :-)

It would be nice if it could handle written Cantonese too, just by changing the dictionary?

February 12, 2014 at 01:19 PM

- "Chinese Word Extractor". Compared with your idea, it works the other way around: you can analyse a text to extract the vocabulary, but you can also filter the analysis using stop lists of known words. So, in theory, the number of words left gives you an idea of the difficulty of the text. If I'm not mistaken, the author of the software is a member of this forum, so he may chime in and give better explanations.

Yes, I'm here! That program would work well with that approach. However, what I personally find is that it's really tedious to curate a list of every word you know.

Another way of looking at the problem, which is used quite a bit in education, is scoring the text for readability based on word frequency. In English there are the Lexile and other readability frameworks (e.g., see here for a comparison), which are a main component of public education in the US. The late Dr. Hayes of Cornell developed some general scoring methods such as LEX and QLEX. On my earlier online version of the word extraction tool, I calculate the value for meanU, which is just the average frequency of the words in the text. It's not a great predictor, other than differentiating between very hard and very easy texts.

Adding readability scores to the CWE program is something I hadn't thought of before, but it would be a great idea to add it.

February 12, 2014 at 02:49 PM

I calculate the value for meanU, which is just the average frequency of the words in the text.

Out of curiosity, do you mean that the programme (1) extracts each word from the text; (2) checks the frequency of each word in a corpus/ an established list of frequencies for Chinese words; (3) calculates a global index for the text? There was a time when I used that meanU thing to choose texts. Why was it available only in the online version?

it's really tedious to curate a list of every word you know.

Yes, however I've used your programme with ready-made stop lists consisting of HSK5 or HSK 6 words. It gives a very rough indication of the difficulty of a given text.

February 12, 2014 at 04:32 PM

These are excellent suggestions. I'm going to do more research into this and try to hack up so something.

I'm a developer by trade and have always been interested in computational linguistics, although I'm a bit rusty on the linguistics side.

I would write more but I'm typing on my iPad and it's getting annoying to type...

February 12, 2014 at 06:43 PM

Yes, I'm here! That program would work well with that approach. However, what I personally find is that it's really tedious to curate a list of every word you know.

To me the analyses of many articles/books is more tedious then keeping the known vocabulary. The vocabulary is just an export of your srs data and update it every once in a while... Of course to make that workable you need to add all your vocabulary to your srs system:) To me that works great as I started 'serious' study by learning characters through anki and don't drop the known vocabulary as some here like to do. If known vocabulary shows up in the top of the frequency list of text analysis I still add it to anki.

Rating text according to some general rating can be very handy, but is also a little bit tricky as many people are not average. After learning to a certain level people tend to start reading what interests them and move with their reading material away from the average. Non the less it might be very interesting meta data. Specially if the rating is calculated over the unknown vocabulary. I expect the combi of % of unknown vocabulary, a general difficulty ranking of the unknown vocabulary, and subject/title would make a good personalized choice easy. Perfect would be google (baidu or whatever favorite search engine) search extended with the indicators of %known and difficulty rating of the unknown vocabulary:)

February 13, 2014 at 10:42 AM

As stated above, I think starting with HSK 4, 5 or 6 as a baseline (and/or adding a users submitted SRS list) would work well as a start.

Sign In

idea for finding chinese texts at the appropriate level

Recommended Posts

新墨西哥人

Link to comment

Share on other sites

Silent

Link to comment

Share on other sites

Kelby

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

laurenth

Link to comment

Share on other sites

querido

Link to comment

Share on other sites

c_redman

Link to comment

Share on other sites

laurenth

Link to comment

Share on other sites

新墨西哥人

Link to comment

Share on other sites

Silent

Link to comment

Share on other sites

icebear

Link to comment

Share on other sites

Join the conversation