Jump to content
Chinese-Forums
  • Sign Up

What tools does one need to analize texts? (statistics, software, lol IDK...)


Kelby

Recommended Posts

Hey folks, I have a question for those more statistically and linguistically minded folks. I'm looking for tools that I can use for doing more statistical type analysis of readings. For example, comparing a text to a list of vocabulary or characters to see what words occur in it and at what frequency, so as to address the amount of unfamiliar vocabulary and characters. Another sort of thing I'd like to do is take a look at the frequency words or characters occur within the text. This is to populate a list of words most important to the text. This could obviously be done tediously by hand, but I imagine you understand why I'd like to avoid doing that.

Forgive my ignorance on all of this stuff. I discovered linguistics as a major three years too late and while I enjoyed stats when taking it, the only statistics class my schooling featured came at the end of my college career when my choice was find a job or go to class (seemed like a good idea to skip at the time :P).

I know these things are possible now in the age of internet machines, the Googler, and Windows versions named by number and not year, but I just have no background in any of it.

Anywho, the 'too long didn't read' version is that I'm looking for your go to tools for analyzing texts in this manner.

Link to comment
Share on other sites

http://www.chinesetextanalyser.com for those who are interested (currently windows only).

Doesn't have any documentation yet, but if you're reasonably comfortable using computers it shouldn't be too difficult to figure out. Just open a document and it will segment and parse it and provide various statistics.

You can then marks words or entire documents as 'known' (which get tracked across all documents) and then export various different word lists (based on a bunch of optional fields, including sentence and close extraction) based on known/unknown words in a document and sorted by frequency or first appearance and more.

It's still missing a number of key features hence the reason I haven't made a more public announcement about it yet. But those who are interested are welcome to play with it.

  • Like 1
Link to comment
Share on other sites

Cool tool.  Love the export functionality.   

 

Possible features I would be interested in:  

  • Frequency of term globally (using some corpus) and HSK level.   So I can see if this is a common term outside the text, or rare.  Useful for all those frequency 1 terms, some of which are super rare, some are common.
  • Keyboard shortcut on known, unknown marking.
  • Inline definitions (so you can use as a reader) 
  • Exportability of the known list

Otherwise, release when you like I'll buy!

Link to comment
Share on other sites

I'm a little hesitant to add frequency based on external lists because such lists are highly context dependent - if you read mostly newspaper articles you'll have one set of frequent vocab, if you read novels another. The main purpose of the tool is to help you with content that you personally use/encounter regularly and so a generic external frequency list might be misleading as to what is the most relevant for each particular user.

A future planned feature will allow you to scan/parse complete directories and sub directories and so that should allow you to see accumulated statistics for a large number of files. Anyway it's not to say I won't add this at some point but I'll to think about a good way to integrate it.

Another feature will be a word list manager that allows you to check the document against common word lists e.g. HSK as well as custom lists.

Regarding definitions, I won't have automatic pop up definitions because it tends to give readers a false sense of understanding. What I'll be having instead is manually requiring the user to say 'look up definition for this word' (either by keyboard shortcut or context menu). Such words will then automatically be marked as unknown for a fixed period of time - because if you have to look up a word, even just to 'check', then that indicates you don't know the word well enough for the purpose of reading.

Re: keyboard shortcuts, I'm a big keyboard user so there will be shortcuts for most features, the reason there aren't any yet for marking words is because I also need to figure out a nice way to navigate between words. Arrow keys are the obvious choice, but I'm also going to look for something that doesn't require fingers to leave the home keys.

Exporting of the known word list will be included in the word list manager. In the meantime, you can copy/edit the file manually:

C:\users\<username>\appdata\local\chinesetextanalyser\wordlists\known.txt

It's just a UTF8 file with one word per line. Note: appdata is a hidden directory so you need to either have hidden files visible or type it in to the address bar.

Finally, although it's not officially released, if your trial expires before the official release you can still buy a license at any time :-). Licenses will be perpetual and last across versions and OSes.

Let me know if you have any other feedback or suggestions.

  • Like 1
Link to comment
Share on other sites

 

I'm a little hesitant to add frequency based on external lists because such lists are highly context dependent - if you read mostly newspaper articles you'll have one set of frequent vocab, if you read novels another. The main purpose of the tool is to help you with content that you personally use/encounter regularly and so a generic external frequency list might be misleading as to what is the most relevant for each particular user.

Absolutely right, but when in doubt what to learn a frequency ranking can be an aid to decide. E.g. when you progress a text will contain a relatively limited number of unknown frequent words, but a huge number of words that occur only a few times. A general frequency based list will help rank them.

 

 

A future planned feature will allow you to scan/parse complete directories and sub directories and so that should allow you to see accumulated statistics for a large number of files. Anyway it's not to say I won't add this at some point but I'll to think about a good way to integrate it.

This would be a very slick feature. If you can rank a directory with texts/books in order of difficulty (% known), that would make finding suitable material a lot easier.

 

 

Regarding definitions, I won't have automatic pop up definitions because it tends to give readers a false sense of understanding. What I'll be having instead is manually requiring the user to say 'look up definition for this word' (either by keyboard shortcut or context menu). Such words will then automatically be marked as unknown for a fixed period of time - because if you have to look up a word, even just to 'check', then that indicates you don't know the word well enough for the purpose of reading.

I understand the reasoning, but it doesn't make me happy as for analysis I define as known everything in my study deck. That way I prevent wasting time adding words to my study deck that are already there. (I take for granted that it means a huge overestimation of what I know.)

Link to comment
Share on other sites

but it doesn't make me happy as for analysis I define as known everything in my study deck.

That's where the handy 'mark exported words as known' feature comes in useful.  When you export wordlists to a file, you can optionally decide to automatically mark the exported words as known (with the assumption that you'll be adding these words to some sort of flashcarding program for further study).  For reference, this was not available in earlier versions but is now in the most recent version.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...