Jump to content
Chinese-Forums
  • Sign Up

锵锵三人行 'corpus'!


realmayo

Recommended Posts

Whoops! I obtained the numbers from - of course - Chinese Text Analyser, but I forgot that I have it using a smaller dictionary now. Sorry for this oversight.

 

Edit: I just tried it with a fresh CEDICT and the numbers are the same as the above. I'll let somebody else sort it out.

Link to comment
Share on other sites

Actually what would be interesting (relative to the values of interesting with which we're working here) would be to compare that to other corpuses (corpii?) and see where the differences are. I note 美女 ranks in the top 1,000, with 客人 and 观众 lagging further behind. 

Link to comment
Share on other sites

Seeing stats like this, I think it really helps drive home the point I have been making in other threads that once you reach a certain level, vocabulary becomes dramatically less important and provides only marginal increases in understanding.

 

That level is also significantly lower than most people believe.

  • Like 1
Link to comment
Share on other sites

once you reach a certain level, vocabulary becomes dramatically less important and provides only marginal increases in understanding

 

And conversely, once you reach a certain level, every little step higher in understanding requires an enormous amount of exposure!

 

I find it a bit depressing: I could read for hours every day but after a year if I pick up a new book at random, my comprehension rate -- in theory -- will be basically the same as it would have been 12 months earlier.

 

In practical terms though this shows that once you're at that certain level, there's greater impact going over previously-learned but perhaps-forgotten, or only half-remembered, higher frequency words, to make sure you have them rock solid, than in learning new ones.

  • Like 2
Link to comment
Share on other sites

I find it a bit depressing: I could read for hours every day but after a year if I pick up a new book at random, my comprehension rate -- in theory -- will be basically the same as it would have been 12 months earlier.

Probably not though.

 

Take for example the corpus you put together.  338,000 characters, so at the relatively slow reading speed of 150 cpm, it would take you about 40 hours to read through the entire transcript.  An hour a day, would see you get through the above wordlist in a little over a month.

 

After a year of doing that, the numbers will be a bit higher for quite a few of those single item frequencies.  Maybe you'd see them 2-3 times in the entire year - which if you translate that to SRS terms is a revision every 4-6 months, which is probably not so bad if you spend some time learning the word properly the first time.

 

 

 

In practical terms though this shows that once you're at that certain level, there's greater impact going over previously-learned but perhaps-forgotten, or only half-remembered, higher frequency words, to make sure you have them rock solid, than in learning new ones.

I think this is very true.  It also highlights the need to make sure you are learning words from material that you are actually encountering, otherwise there's a high chance that you're learning words you'll never see/use.

Link to comment
Share on other sites

Another interesting tidbit (which I got after getting the original Word document from realmayo and playing around with it myself in CTA) the magical 98% comprehension kicks in at almost the exact spot where the single frequency items start (less than 10 words different).

 

So it's not so dire as post #12 makes out.  You could never learn the single frequency items, and still have enough comprehension to learn new words almost entirely from context.

Link to comment
Share on other sites

20-25 minutes.

 

So for 100 shows, it's about 3383 total characters per show and about 135-170 cpm.

 

Word wise, it's about 2334 total words per show, or about 93-115 wpm.

 

There will be some margin for error as it doesn't take in to account pauses and cutting to ads, and title sequences and so forth.

Link to comment
Share on other sites

I know we all like the show, but the few transcripts I worked on were quite buggy, and a teacher I showed them to suggested they might have been generated by machine (doubtful) or at least quickly and cheaply (more likely). If you're going to put effort into such a project, I suggest something with more reliable transcripts. Unfortunately none spring to mind.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...