Jump to content


CLI ad

Put your message here

Welcome to Chinese-forums.com


Since 2003 we've been helping people learn Chinese, study and work in China, find Chinese books, movies or music. We are active, friendly and helpful. Check out recent and popular posts on the home page, see the full forums listing or sign up for free now.

Member quotes:

"This forum is a goldmine of information, and I'm so glad it's here!"
"...the kindest, most interesting and most useful help."
"...a godsend!"

Popular Posts

Photo
- - - - -

Creating lists of unknown Hanzi?


  • Please log in to reply
18 replies to this topic

#1 share HerrPetersen

HerrPetersen
  • user photo
  • Members
  • 228 posts

Posted 12 February 2009 - 08:07 AM

I have a spreadsheet with all the Hanzi I have learned so far. I have gathered some material (text and vocabs) which I want to learn. Is there a possibility to check all items in the to-be-learned-Chinese text for Hanzi that are in my Hanzi-spreadsheet and automatically delete those? (This would result in a list of the Hanzi I have not learned so far).
Any ideas - has this been done before?

Edited by HerrPetersen, 13 February 2009 - 01:49 AM.

  • 0

Site Sponsors

Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Mandarin in Beijing Tailored, intensive classes from professional teachers, for real results. Business Chinese
Learn Chinese in China Learn to speak Chinese 1MonthChinese.com -Mandarin School in China.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
The Hutong School in China Learn Chinese, intern or volunteer in Beijing, Shanghai and Chengdu with the Hutong School.
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.
Put your message here

#2 share HedgePig

HedgePig
  • user photo
  • Members
  • 224 posts
  • Chinese:Permanent Beginner
  • Location:Shanghai

Posted 12 February 2009 - 09:28 AM

HP to HP

I haven't done precisely this but it would be fairly easy if you are happy using VBA.
I would do it as follows
(1) Read in the known hanzi into a collection, using the character as the index
(2) Read through your text and check each character against what is in the collection to see if it exists.(If a lookup against your reference collection returns an error, then it is a new character makes sense)

If you are not comfortable doing this and want to send me your spreadsheet, I may get a chance to look at it (no promises).

Regards
HedgePig
  • 0

#3 share renzhe

renzhe

    First Episodes Captain

  • user photo
  • Members
  • 4,912 posts
  • Chinese:Reasonable
  • Location:Algarve

Posted 12 February 2009 - 09:29 AM

This is a python script I have lying around for doing exactly that. You'll need a python interpreter, or you'll have to rewrite it in VB or whatever excel uses.

It's not a quick algorithm, and it doesn't filter out duplicates (I use it for filtering character frequency lists, which have no duplicates), but it could be a starting point.

Attached Files


  • 0

#4 share c_redman

c_redman
  • user photo
  • Members
  • 219 posts
  • Location:North Carolina

Posted 12 February 2009 - 11:17 AM

In all the years I've wondered about an Excel function to do this, I just never bothered to look it up or figure it out. I'm just trying it now, and this seems to work:

- Your established reference list is in a column, say A1 through A500
- Your tentative list is in another column, say C1 through C25
- In the cell next to C1 (e.g., B1), enter "=VLOOKUP(C1,A$1:A$500,1, FALSE). The "$" is important for the next step to work
- highlight the cells B1 through B25, and type Control-D, or Edit->Fill->Down
- You should see "#N/A" for any cell not in the master list
- Select rows B and C (and any others associated), and sort by row B.
- Delete all the cells in row C which have "#N/A" next to it

Note: IANAEW (I am not an Excel wizard)
  • 0

#5 share imron

imron

    Admin

  • user photo
  • Administrators
  • 9,485 posts
  • Location:国外

Posted 12 February 2009 - 11:31 AM

You'll need a python interpreter,

Python can be downloaded here.
  • 0

#6 share HedgePig

HedgePig
  • user photo
  • Members
  • 224 posts
  • Chinese:Permanent Beginner
  • Location:Shanghai

Posted 12 February 2009 - 01:50 PM

I think this spreadsheet should do the trick but I haven't checked it thoroughly., so no guarantees! Instructions are in the notes tab.

Regards
HedgePig

Edited by HedgePig, 12 February 2009 - 05:39 PM.

  • 0

#7 share HerrPetersen

HerrPetersen
  • user photo
  • Members
  • 228 posts

Posted 12 February 2009 - 06:45 PM

Wow - thanks for the plentifull replies. I will check out HedgePig's spreadsheet once I am in university. (My home computer runs OpenOffice).
I have very little programming experience (unless I am talking to some pretty good programmers - then it's not "little" but more like "none"), so I will first check out the Excel-sheet before trying my luck with python or VBA.

Edit: @HedgePig - Great job with the file! It works like magic. I also tried to open it with OpenOffice. While there is a buttom for "Analyse" nothing happens when pressed - so unfortunatly Microsoft is the way to go here.

Edit2: There is a very minor thing that is strange: I put in a list of roughly 2500 hanzi.
I checked the list of those 2500 hanzi against the list of the 2500 hanzi itself. It did not produce an output of "2500 known hanzi", but rather an output of 2499 known hanzi and 1 unknown hanzi: Now what is so special about 钱?
Not that this takes anything away from the programm - it is just a little strange. If interested, here the file:

Attached Files


Edited by HerrPetersen, 13 February 2009 - 08:57 AM.

  • 0

#8 share HedgePig

HedgePig
  • user photo
  • Members
  • 224 posts
  • Chinese:Permanent Beginner
  • Location:Shanghai

Posted 13 February 2009 - 11:07 AM

Hello H P

What so special about 钱? Well, as they say, money changes everything :-)

In this case, your reference list includes a space after the 钱 character, so the program is checking to see whether "钱 " is a known character, not "钱"

There is also a space your "source" list but this doesn't matter as any Western characters, punctuation, spaces, etc. are ignored (actually a little cruder than this but essentially works like this.)

I guess I should change the program so that it only picks up the first character in the reference list, or at least pops up a warning or something. I might try that later.

Regards
HP

P.S. Glad you find it useful.
  • 0

#9 share HerrPetersen

HerrPetersen
  • user photo
  • Members
  • 228 posts

Posted 14 February 2009 - 08:00 AM

Hi H P,
Damn, it was too strange to be just a random bug - deleting the space fixed it. Yea, I like it a lot.
Cheers,
HP
  • 0

#10 share m_k_e

m_k_e
  • user photo
  • Members
  • 16 posts

Posted 08 April 2009 - 04:49 PM

If you're on a *nix system, this may work, too:

me@you:/tmp$ echo "这
> 是
> 一
> 个
> 据
> 自" > known_chars
me@you:/tmp$ echo "这
> 不
> 十
> 一
> 个
> 句
> 资" > new_chars
me@you:/tmp$ grep -v -f known_chars new_chars





Or, if you have a ruby installation:
me@you:/tmp$ ruby -e 'f1=IO.readlines("new_chars");f2=IO.readlines("known_chars");puts (f1-f2).join()'




  • 0

#11 share Ednorog

Ednorog
  • user photo
  • Members
  • 11 posts
  • Location:Sofia, Bulgaria

Posted 26 January 2010 - 03:21 PM

I have a list of all Chinese characters I've studied so far. I would like to find a way to sort out characters that are not on that list. That is, for example, when I copy some text, I would like to be able to see the characters that are new to me, that is, the ones that have no match on my list.

I've been using Wakan for quite some time and it has been doing an excellent job for some time. The problem is, the number of characters on my list has grown, they're a little over 4000 now, and Wakan's support for Chinese characters is pretty poor. For example, of the 20 newest characters that I've added to my list, it only recognizes 13, which is quite a poor ratio.

So, does anyone know any software that can help me on that? Any help would be greatly appreciated. :)
  • 0

#12 share jbradfor

jbradfor
  • user photo
  • Members
  • 3,181 posts
  • Chinese:Intermediate
  • Location:WA, USA

Posted 27 January 2010 - 11:16 PM

Look here and here.

Do you really want characters, or do you want words? The first focuses more on characters and will only show you new characters, the second will parse a text into words (with some degree of accuracy....) and, for the new words, get the pinyin and the definition.

Both are linux based.
  • 0

#13 share chrix

chrix

    Admin

  • user photo
  • Members
  • 2,138 posts
  • Location:Texas/Holstein/NTT

Posted 27 January 2010 - 11:41 PM

A couple of questions for clarification:

1. so you know 4,000 characters now? Are you using a SRS like anki to help you remember this large amount? Or are you reading a lot to maintain your level?

2. If you indeed know 4,000 characters, I would highly advise to concentrate on words not characters. There are multiple threads in the "General Study" forum on this, this one, for instance.

EDIT: oh jbradfor beat me to it :mrgreen:

Edited by chrix, 28 January 2010 - 12:03 AM.

  • 0

#14 share Ednorog

Ednorog
  • user photo
  • Members
  • 11 posts
  • Location:Sofia, Bulgaria

Posted 28 January 2010 - 12:18 AM

Many thanks to both of you for responding.

Jbradfor, I'm indeed very grateful to you, I was asking for characters, but I think the second one might actually be even more useful - if I get to work, of course. I'm gonna try those as soon as I get a chance. I don't use linux but I suppose it's ok if I just use those codes in python under windows (don't know how that sounds, I'm pretty illiterate as far as both coding and linux goes... but I sure can do as much installing python and testing the codes)

Chrix, you guessed right, I'm using anki to study characters/words; and I also do a lot of reading. And yeah, I've realized that there's not much use focusing on characters, with over 4000 I seldom encounter any unfamiliar ones unless I'm deliberately looking for them. It's been 4 years and a half since I started studying Chinese but for the last one year or so I've added probably less than 15% of those 4k, so the curve is a lot flatter now...

My intention is to use this kind of mostly software for statistical purposes, for example, if I open a web page or a short story or a novel etc., I'd be able to find out how many unknown characters there are in it.
  • 0

#15 share jbradfor

jbradfor
  • user photo
  • Members
  • 3,181 posts
  • Chinese:Intermediate
  • Location:WA, USA

Posted 28 January 2010 - 12:51 AM

@Ednorog, the second one is a horribly kludged together, mixing python and gawk -- and I'm allowed to say that, as I built it :wall

If you really like it, with a bit of work I could move it all into python, if that helps. I've gotten no requests yet, so it's not done.
  • 0

#16 share HerrPetersen

HerrPetersen
  • user photo
  • Members
  • 228 posts

Posted 28 January 2010 - 04:32 PM

Did you check out this page?
http://www.chinese-f...ead.php?t=28000
  • 0

#17 share chrix

chrix

    Admin

  • user photo
  • Members
  • 2,138 posts
  • Location:Texas/Holstein/NTT

Posted 28 January 2010 - 06:37 PM

Merged.
  • 0

#18 share Ednorog

Ednorog
  • user photo
  • Members
  • 11 posts
  • Location:Sofia, Bulgaria

Posted 29 January 2010 - 12:10 AM

Ok, that .xls file by HedgePig was everything I asked plus more (the number of instances of each character in the sample was a huuuge bonus)! :)
So, thanks a million, to both of you, HP's! :)

As far as the python scripts are concerned, I got totally nowhere so far and I've pretty much given up since I apparently need to know a lot more about coding than I presently do (which is not very far from zero, actually).
  • 0

#19 share phyrex

phyrex
  • user photo
  • Members
  • 199 posts

Posted 01 April 2010 - 03:00 AM

Seems I made something similar. See here: http://www.chinese-f...ad.php?p=228844
  • 0


1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users