Jump to content
Chinese-Forums
  • Sign Up

Make Me a Hanzi: free, open-source Chinese character data


skishore

Recommended Posts

Looks fantastic, I don't understand the programming side of things but I do understand free and open source :)  And I would be very keen on a character learning app that was free :)

 

One of the things about chinese is the need to spend a lot of time practising writing and learning characters and anything that can contribute to this has got to be good.

Link to comment
Share on other sites

Thanks for the kind words, everyone!

 

@wibr Wow, I didn't know about that Skritter repository. I spent a while looking through it. To make use of it directly, I'd have to reverse-engineer their network protocol as all their data comes from the server. Maybe I can just pull out the character learning library, though.

 

@kawakusong Sure, I can explain! The character recognition uses the "medians" field from the graphics data file. I wrote some code to compress that data down to 1Mb. When the demo site is first loaded, it loads and decompresses the medians data, then uses it for matching. After the data is loaded, matching is entirely client-side. The matching algorithm is extremely simple at the moment - it expects inputs to have correct stroke order and compares it against different characters' medians by angle and position.

 

@XiaoKui For this demo site, I want the handwriting recognition algorithm to be more liberal. It's pretty basic at the moment and I would definitely like to improve it. One major problem with the recognition that I already know of is that for components like 女, 艹, 骨, it only recognizes the variant used by the Arphic font in each character. If you have examples where it failed to recognize your writing, it would be great if you could share some screenshots!

 

@boctulus No, it's purely a hobby project!

 

It sounds like getting Skritter's client or some similar app working is a good next step. I'll work towards that.

  • Like 1
Link to comment
Share on other sites

@skishore:  you found "88% characters have the phonetic component on the right"  (very interesting!)

 

I'd like know something about the complexity (related with number of strokes) of phonetic component:  is there any relationship ? (they have less strokes than the semantic component -in average- ?)

Link to comment
Share on other sites

Hate to be a wet blanket here, but a legal concern: isn't the Arphic Public License for the fonts non-commercial? (this is why we haven't tried to do something similar at Pleco to supply stroke order in 楷体 based on those fonts - don't want to get sued) Wouldn't seem to impact your project but might affect others making derivatives of it.

Link to comment
Share on other sites

That's right, the fonts are distributed with Ubuntu and are licensed under the older, free-software APL. Perhaps I should build another round of the stroke order data with the new fonts too, though - depending on the details, the non-commercial aspect may not a problem for me.

 

@boctulus Interesting question! Now, I can guess the result ahead of time (think of all the common radicals like 亻, 讠, 辶, 艹, 女, etc), but it's good to confirm:

  • Out of 6602 examples in the dataset with a pictophonetic etymology, in 76% the phonetic component had more strokes, and in 16% it had fewer.
  • The average number of strokes in the phonetic component was 7.8, while in the semantic component it was 4.5.

This analysis is not that great. A better one would take character frequency into account and also compute simplified and traditional statistics separately. The numbers for traditional characters would likely be a lot closer. I'm leaving that to you all for now, though! The quick script I wrote to get these numbers is here.

Link to comment
Share on other sites

  • 3 years later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...