Jump to content
Chinese-Forums
  • Sign Up

HSK Online Searchable Vocabulary Database now in Beta Test


roddy

Recommended Posts

Roddy:

Someone mentioned flashcards earlier. Here's another option.

I run http://www.yellowbridge.com, which includes a Chinese language section. I recently started offering free online flashcards based on the "Integrated Chinese" textbook series as well as a couple of lists of frequently used characters. It occurred to me that it would be relatively easy to create a set of flashcards based on the HSK word list you've already created. However, having done a fair amount of Chinese data entry myself, I know that it was a lot of effort to enter the list so I don't want to "borrow" your list without your permission.

Please take a look at http://www.yellowbridge.com/language/flashcards.html to see my existing flashcards. Many students have found flashcards to be one of the few effective tols for memorizng Chinese words and characters.

If you do decide that it is OK for me to use your list, I could extract the necessary information directly from your query tool so you wouldn't have to provide me with anything special. Proper attribution would be provided, of course.

I indicated that it would be relatively easy but it is not without effort. A quick trial run using the HSK level 1 list indicated that about 40 entries (out of 1000) lacked corresponding entries in CEDICT. I would expect that the percentage of missing entries would be much higher in the higher levels. I would still need to manually track down the definitions of the missing words.

Jaime

Link to comment
Share on other sites

Roddy:

Thanks for the extremely useful resource.

The Xian Dai Han Yu Cidian 2003 now seems to use only pinyin "yi1" exclusively for the character "一".

In fact page 1483, of that august publication, under the head word yi2 一, refers the user to look under yi1.

But 一定 in HSK is under pinyin "yi2 ding4".

Xian Dai... uses "yi1 ding4".

ABC also uses yi1 ding4 but mentions thst yi2 ding2 is pre ABC.

The Xian Dai.. reasoning behind this is that the character 一 is only listed under the single pinyin yi1.

It then follows that all compound words using 一, as the first character, should be under the yi1 pinyin headword.

If this standard was not adhered to then one would have no end of trouble finding words as one would be unsure of the headword pinyin.

Not sure if you intend to use Xian Dai ... as the standard but if you do then all the yi2 pin yin may need to be changed to yi1.

Thanks again.

Myles

Link to comment
Share on other sites

Roddy G'day...

The database is now delivering the search outcome twice.

For example with input ....

zuo4feng1 HSK level3 list

the outcome is

作风 zuo4feng1 (3)

作风 zuo4feng1 (3)

Earlier on the outcome was only a single line

作风 zuo4feng1 (3).

Otherwise it is all working fine.

Regards

Link to comment
Share on other sites

The entire database is now available in Comma Separated Variable format at

http://www.chinese-forums.com/vocabulary/HSKcsv.csv (220KB)

If anyone wants to use this for their own purposes, go ahead - I'd appreciate it if you link back to the forums and let me know what you are using it for though.

The doubled results problem has been solved - a mistake I made last time I 'fixed' something.

As for pinyin for 一, I just used what was on the HSK vocab list I was using - I'm not going to cross-reference various standards.

Roddy

Link to comment
Share on other sites

  • 2 weeks later...

I have gone ahead and created online flashcards based on the HSK word lists (courtesy of Roddy, thanks). At present only Levels 1 and 2 are ready. You can access the flashcards at http://www.yellowbridge.com/language/flashcards.html. Feedback and comments are appreciated, of course.

Question for Roddy. The csv file still contains a number of duplicates (not a lot). The ones with slightly different pronunciation, I understand. However, there are a few which appear to be truly duplicates. I was wondering whether this was the result of the same bug you mentioned earlier or whether there are other differences in the original source that wouldn't show in the file (such as different meanings or different traditional characters).

Jaime

Link to comment
Share on other sites

That's a result of me blindly copying the HSK word list - for example, in the list I have, 把 is in twice, as 介词 and 量词 - so in that case it's because there are different meanings. However, in other cases there might be only one entry which is classed as 名, 动 and I've only put it in once.

Roddy

Link to comment
Share on other sites

  • 2 weeks later...

Roddy G'Day...

Not sure what is happening but I did a search with for 办* on the web page.

It did not return 办公室 which is level 1 in your csv list.

各* does not bring up 各种 from level 1 either???

帮* does not bring up 帮助,bāngzhù,bang1zhu4,1 either

etc. etc.

Also why the ban4gong1shi4 seems to be so popular these days I think pinyin in the csv list has the 公 as a third tone.

Another pinyin worth look at is 打算,da3suan4,1

SDHYCD has that listed as da3suan5

Similarly these pys differ from SDHYCD

关系,guan1xi,1.................... guan4

后边, hou4bian1,1.............. bian5

里边, li3bian1,1 ............... bian5

哪里, na3li3,1........................li5

痛快, tong4kuai,1............... kuai4

I have mentioned this before but the entries from

一定,yídìng,yi2ding4,1

down to

一直,yìzhí,yi4zhi2,1

have different piyins from SDHYCD

The following csv line could also do with a tweak....

歌,gēge,ge1ge,1

mph

Link to comment
Share on other sites

  • 2 weeks later...
  • 2 months later...
  • 3 weeks later...

Ended up looking at the pinyin for Roddy's HSK list tonight. Am curious about the following characters whose pinyin representations seem uncommon to me. I'm probably wrong, but thought I'd flag them just in case.

Most are duoyinci, in which case its an issue of which is more common. Anyone care to comment?

yan4 咽 (yan1)

zhao2 着 (zhe3) (zhao1 for 着急?)

ben4 奔 (ben1)

he1 呵 (he5)

jia4 假 (jia3)

ning3 拧 (ning2)

ying4 应 (ying1)

cheng4 秤 (chen4)

ding4 钉 (ding1)

fan4 泛 (fa2)

feng4 缝 (feng2)

heng4 横 (heng2)

huo1 豁 (huo4)

juan4 圈 (quan1)

nan4 难 (nan2)

tiao3 挑 (tiao1)

Also noticed this entry with odd formatting for the pinyin column:

4-Jun 俊 (jun4)

Link to comment
Share on other sites

  • 2 weeks later...

Ok, after a brief hiatus of 6 months or so, I finally got back to this . . .

Todays Changes:

1) You now have English for many of the entries. For this you should be thanking our very own Trevelyan, who ran the list through the magic machine which powers his Adsotrans page (check out the very useful webpage annotation function). Should have done this ages ago, wasn't difficult at all (especially as someone else did the difficult bit for me).

Caveats are that the English entries are neither complete nor perfect. Some are missing, and in other cases entries which should have different English translations actually have both (ie the 2 entries for 应(1st tone and 4th tone) should be 'should' and 'answer, respond' respectively. Actually, they are both 'should, answer, respond').

If you want to suggest changes / improvements to the English, please do so via Adsotrans and when that's updated I'll import the new entries - I'm nto going to edit the English directly (I think)

2) Number of errors and typos corrected, including some of the above.

3) In addition to list and card output, you now also have the option for a CSV (comma separated variable) file. This can be opened in Excel, many database programs and also some flashcard programs such as Supermemo. This does not happen as a .csv file, you'll get an html page of text you can copy and paste.

4) I put back the 2000 entries I forgot to import last time I updated the database :oops:

Next plans are - word class information, and searching on the English entries. Suggestions on how this can be made more useful are welcome, and I'll see what I can do (over the course of the next decade :wink: )

Roddy

Tiny Edit: 连续 is misplaced, and depending on how you order things may not appear where you expect it to in a list. However it is there.

Link to comment
Share on other sites

  • 3 weeks later...

First, it is a wounderful resource you have! :clap

Since I will go up and take my test soon i was using the list and noticed a few odd things. To make sure it was odd and not just me and my stupid brain i run a test against it. I took your list and converted into a database and run this query against it:

select hsk_out.hsk_level,hsk_out.pk,hsk_out.chinese,hsk_out.pinyin,hsk_out.translation1

from hsk_out inner join

(select chinese, pinyin,translation1

from hsk_out

group by chinese, pinyin,translation1

having count(chinese) > 1

) as duplicates

on hsk_out.chinese = duplicates.chinese

and hsk_out.pinyin = duplicates.pinyin

and hsk_out.translation1 = duplicates.translation1

The result will be 290 lines = aprox 100+ dupplicates. This i can fix for you in no time if you want my help. Another strange thing, or it might be just me not knowing better, is that characters like one(yi1) appears to be in both level 1,2 and 4. Can it be so?

Finaly, if you want help with adding english definitions to the missing ones i can help you with that too. Still machine analyzed but i found all missing translations. I can also contribute with links to stroke order animations for many of the characters and direct link to zhongwen.com for genealogy.

I am still moving my old office to a new one after a fire but i will try to find some time to go through this carefuly and give a list in any format you like when i am done. If you let me know the format you prefere i will create such a file for you.

Finaly, a big, big thanks for this list. It have helped me a lot :D

Link to comment
Share on other sites

This night i run a few more test against the list and added traditional characters, many more translations and variantions of translations and added tone marks to pinyin. The list is now striped on duplicates. I run it against the adso dtabase as well to update it with word classes and against Unihan to get correct pinyin. Let me know if you want the list.

Now i am looking on adding measure words and other stuff.

Link to comment
Share on other sites

Mandarinboy -

Could you give me some examples of 'duplicates'? As far as I am aware there are none - there are however cases where the same character is in two or more levels with different meanings / pronunciation - but I don't consider this a duplicate. Also, the English is often duplicated, but as I said re the English

Some are missing, and in other cases entries which should have different English translations actually have both

I also already have tone marks with pinyin - when you run the search you can choose if you want the marks or the numbers. Word class information from Adso I also already have, I just haven't integrated it into the search function.

Any improvements to the English, I would suggest you send them to ADSO, and then I'll import them from there.

To be honest I'm not sure how much time I'm going to have to work on this over the next few months - if you want to do stuff for your own benefit that's great, but I can't guarantee I'll be able to use any of it.

Many thanks for your interest

Roddy

Link to comment
Share on other sites

E.g

把 ( two entries in level 1 and one more in level 3 )

白 ( one entry in level 1 and one in level 2 )

当 ( two in level 1 and one in level 2 and one in level 3

点 ( 3 times in level 1 )

and so on. There are around 150 duplicates like those, if i am right.

As i an see they have the same pinyin and the same meanings. The result i have from your page: http://www.chinese-forums.com/vocabulary/

If you need some help, just let me know and i can clean them up for you and add additional functionality. For me they are so extremely helpful and i am very, very happy that you have taken your time to put up this woulderful tool. :clap It have been so helpful for me in the past weeks.

Link to comment
Share on other sites

I see what you are referring to - however, if I check my HSK word list, I find

把 - level one has two entries, once as preposition (把啤酒给我) and once as measure word (一把钥匙) and then in level 3 as a verb.

With 点,it is in level 1 three times - as a measure word, a noun and a verb.

I don't have time to check the others, but I'm pretty sure there'll be a similar explanation.

What's happening is that as I don't have part of speech info, and the English translations are not tailored to the HSK, but come out of ADSO, these might look like duplicates - but I'm confident that for each apparent duplicate there is something that will distinguish them.

Thinking about it, I'm not sure of the logic behind the HSK wordlists. For example, as I mention 点 has separate entries for it's noun and verb forms. However, 病 has only one and is listed as (动, 名).

Regardless of the reason, it makes this easier for me if I just follow the HSK lists - that way, if there are 8822 entries in the complete lists, I know there should be 8822 entries in my database.

I'm really glad to hear you are finding the tool useful. I'd be interested to know exactly how you are using it, particularly if you are making use of any of the more advanced 'fuzzy' search options.

Roddy

Link to comment
Share on other sites

Roddy, I have read the forum, and i know you already know about ban4gong1shi4 instead of ban4gong3shi4. I have read through the first 400 words, and I found these also:

#86. Cheng2ji4, not Cheng2ji1.

#106: Ci2 not Ci2dai4

#245: Ge1 not Ge1ge

#279: Guo4qu not Guo4qu4

#317: Huan2 not Huan4

#318: Huan4 not Huan2

I hope this helps.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...