Jump to content
Chinese-Forums
  • Sign Up

HSK wordlist softcopy?


m.ellison

Recommended Posts

Assuming the ones scattered around the internet aren't 'legal', no. You get CD-ROMs attached to some HSK books, but I'd doubt they have the list in an easily accessible format. And assuming 'legally' means from the copyright holder, or licensee, I think it's very unlikely they would make the list available in such an easily copyable format.

Roddy

Link to comment
Share on other sites

I've actually wondered about that though - a list, sure. But these lists are only valuable because of the information they contain - the gradings into four groups and maybe also part of speech info / english - although what there is of that in my database came from ADSO. However, it's not a simple list - it's a graded list. So where does that stand?

Roddy

Link to comment
Share on other sites

Law varies country by country. The European Union, Canada and many other countries offer significant protection for sweat-of-the-brow compilations, which means that you can get in serious trouble for copying compilations of materials even if individual entries are not copyrightable in and of themselves.

American law is more permissive in specifying that facts are not copyrightable in compilated form, but the litmus test involved a phone book. American law does provide protection for creative compilations. English-language dictionary definitions are copyrightable, and there is clearly a higher element of creativity involved when one starts straddling languages and explaining foreign words and concepts. Assuming you live in the United States you're probably safe using lists of unambiguous proper nouns, but would be on difficult ground for everything else.

We're trying to be conservative with Adso for exactly these reasons. The Linguistic Data Consortium normally charges an arm and a leg for their stuff and makes it available only to private subscribers at significant cost, but even they released a list containing some CEDICT material free of charge under the CEDICT licence. This suggests that CEDICT will hold up in court. And if it does, any other bilingual wordlist is almost certainly protected as well. The LDC also has other more elaborate lists its makes available for thousands of dollars. If it was legal to dump those lists and redistribute them you'd be able to find them on the Internet. The fact that you can't is probably answer enough to your question.

So be careful. You're probably best off doing this sort of thing from scratch or joining one of the projects already doing this.

Link to comment
Share on other sites

I saw lots of list for Japanese JLPT (Japanese Language Proficiency Test) available for download in many places.

You can get graded lists of Chinese characters from Wenlin software ordered by frequency with readings, pronunciations and examples in 2 forms (trad. and simplified. E.g, just select the first 1,000 and study.

IMO, those HSK lists should be available for free download.

The first 20 most frequent characters from Wenlin:

1 的 [de] (grammatical particle) [dì] 目的 mùdì goal [dí] 的确 [dī] cab

2 一 [yī] one; 一定 certain; 一样 same; 一些 some

3 是 [shì] to be

4 不 [bù] not [bú]

5 了 [le] (particle) [liǎo] 了解 comprehend [liào] (=瞭) [liāo] [liáo]

6 人 [rén] person; 人类 rénlèi humankind; 有人吗? anybody here?

7 在 [zài] at; 现在 xiànzài now; 存在 cúnzài exist

8 我 [wǒ] I, me; 我们 wǒmen we

9 有 [yǒu] have; there is; 没有 haven't; 有的 some [yòu] (=又)

10 中 [zhōng] middle; in; 中国 Zhōngguó China [zhòng] hit (a target)

11 这(F這) [zhè] [zhèi] this

12 大 [dà] big; 大家 dàjiā everybody [dài] 大夫 dàifu doctor

13 国(F國) [guó] (国家) country; 中国 China; 美国 USA

14 上 [shàng] over; top; (go) up; last, previous [shǎng] 上声 [shang]

15 个(F個) [gè] [ge] (measure word); 个人 personal [gě] 自个

16 来(F來) [lái] come; 起来 get up; 原来 it turns out [lai]

17 他 [tā] he, him; she, her; it; (其他 qítā) other

18 为(F為) [wèi] for, on account of [wéi] be, become

19 到 [dào] to, towards, until

20 地 [dì] earth [de] -ly (adverbial particle)

...

Link to comment
Share on other sites

The thieves! Wait till I get my lawyer!!!!!!

I don't think they did get that from here, and if they did they've added a ton of further work on top. Not that it matters anyway, it's not like I am doing anything with them.

I'm not sure about how they separate characters / words for the HSK lists though - they seem to do it just by putting all one-character words under 'characters', and everything else under 'words', which is a bit misleading. 马 and 用 are words in their own right, but you won't find them in the word list, while 初 appears on the word list in 最初, but not in the character list.

I was trying to work around this at one point by having seperate indexes for characters and words, but characteristically I forgot about the project and started something else (I forget what now). I think I've got excel files for both 字 and 词 for the first level somewhere.

Roddy

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...