Jump to content
Chinese-forums.com
Learn Chinese in China

ReubenBond

Resources for developers

Recommended Posts

Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

imron

That's a good list of resources.  For corpora, I'd also add the those from the SIGHAN word segmentation bakeoff, which contains corpora from the following organizations:

  • CKIP, Academia Sinica, Taiwan
  • City University of Hong Kong, Hong Kong SAR
  • Beijing Universty, China
  • Microsoft Research, China
  • Like 3

Share this post


Link to post
Share on other sites
ReubenBond

Thank you, Imron, that data looks great! Am I mistaken in believing that the corpus data there is hand-segmented, and therefore is a fairly reliable gold-standard?

Share this post


Link to post
Share on other sites
imron

You are not mistaken.  It has been hand-segmented (or at least hand verified), and the data has been used in multiple competitions for Chinese segmentation so one would hope that 'many eyes' would have caught most if any remaining mistakes.

 

Regarding 'gold standard', there are multiple arguments for what constitutes a 'word' in Chinese and different corpora have different standards/definitions.

 

I don't think there's ever going to be one authoritative 'gold standard' for the entire language, however I think it's fair to say that the segmented data you can download for each corpus is the 'gold standard' for the definition of 'word' as used by each corpus (the page above has links to a document for each corpus that defines the standard used for determining a 'word').

Share this post


Link to post
Share on other sites
mikelove

I should add our CC-Canto project here - cantonese.org, CC-BY-SA-licensed Cantonese-to-English dictionary along with Cantonese readings for CC-CEDICT.

 

Old version of LDC (which we offer in Pleco) was made freely available (reluctantly) by LDC on account of its being derived from CEDICT. Not sure if it's still on their public website but you can get it from archive.org. Stock Adso does have a lot of Pinyin issues, we've re-generated Pinyin in our version of it from better sources. Both dictionaries were IIRC problematic licensing-wise to build into an app - non-commercial restrictions make us nervous - but we offer them as separate add-on downloads. StarDict dictionaries definitely suspect licensing-wise.

 

Re segmentation, as imron suggests, the lack of standardization among definitions of what constitutes a word is a big problem - bigger still for us because those standards also vary among dictionaries; some dictionaries even include entire phrases, and if one of those is an accurate match we probably want to use it rather than the individual words as it's more likely to offer an accurate meaning. Our "intelligent segmentation" feature does something a bit similar to the graph you describe, but along with frequency and a little bit of grammatical analysis it also takes into account how many dictionaries believe a particular string of characters is in fact a word (with some extra weighting based on which dictionaries we trust more for determining that) - we're working on adding some other factors too.

Share this post


Link to post
Share on other sites
大块头

Wow! Bookmarked. Thank you for putting this together.

Share this post


Link to post
Share on other sites
kangrepublic

Yeah, thanks for putting all this together. Nice to have it all in one place.

Share this post


Link to post
Share on other sites
Weyland

Thank you for this. It's very helpful.

On another note: Has anyone here dabbled in NLP (Natural Language Processing)?

Share this post


Link to post
Share on other sites
mungouk

Great thread, thanks for bumping it @Weyland

 

 

Share this post


Link to post
Share on other sites
philwhite
14 hours ago, Weyland said:

On another note: Has anyone here dabbled in NLP (Natural Language Processing)?

 

I use a free online speech-to-text service to create Anki flashcard from audio sources .

 

I used the online service to generate json transcripts from the audio sources in batches of 100MB at a time. It is fairly accurate for complete sentences. I then use a GNU bash (sed/grep/gawk/ffmpeg) script to generate the .tsv file and .mp3 segments, all in one script for 100Mb of audio and its json transcript (no need to manually run subs2srs).

 

I've also used Apache Lucene and C# to build a Winforms search application on a domain-specifc English-language text database which I gathered

  • Like 2

Share this post


Link to post
Share on other sites
Weyland

@philwhite have you ever looked into spaCy(BERT) or ERNIE?

Share this post


Link to post
Share on other sites
philwhite
On 7/30/2020 at 3:24 PM, Weyland said:

have you ever looked into spaCy(BERT) or ERNIE?

 

Thanks for that spaCy looks interesting for POS tagging and NER. Not sure it would help with my Speech-to-Text projects except, perhaps, for flagging incorrect transcriptions. Ideally, the neural Speech-to-Text services should be using knowledge from BERT to improve their neural nets and improve the accuracy of their transcrptions. After all, BERT is trained to fill in a gap in text with the most likely word.

 

Did you have a specific project or purpose in mind for NLP?

 

I'd read a little about BERT and ERNIE. BERT has been a real game-change, it seems, though I don't know how much training has been done on Chinese.text.

Share this post


Link to post
Share on other sites
Weyland
5 hours ago, philwhite said:

Did you have a specific project or purpose in mind for NLP?


Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels).

Here is another, more abundant, list of resources.

  • Like 1

Share this post


Link to post
Share on other sites
imron
10 hours ago, philwhite said:

though I don't know how much training has been done on Chinese.text.

A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support).

Share this post


Link to post
Share on other sites
philwhite
8 hours ago, imron said:

A lot of the top AI researchers (even in the US) are Chinese, and one of the benefits of that is that alongside English, Mandarin is often a first-class language for a lot of research (not sure if this is relevant to BERT and ERNIE, but a lot of NLP research has good Mandarin support).

 

Yes, one of the biggest breakthroughs in deep learning in recent years was Kaiming He's paper on on deep residual nets when at Microsoft China. Baidu Research in the China and the US have published a lot. Unfortunately, my Chinese is not good enough to read the API of Baidu's web service. In the UK, Prof Guo at Imperial is well known.

 

I was thinking specifically of BERT, I should have written "I don't know how much training of BERT has been done by Google on Chinese text"

Share this post


Link to post
Share on other sites
philwhite
13 hours ago, Weyland said:

Grade the top 100k most common words by difficult, as to give an indications as to how difficult/texts are for foreign language students (based on the new HSK3.0 levels).

Here is another, more abundant, list of resources.

 

Many thanks Weyland for the link to the list of resources. Unfortunately, my reading skills are nowhere near good enough for most of them.

 

Please forgive me, I'm slow on the uptake here. Are you talking about:

  1. Grading common words and (grammar constructs) for difficulty from scratch,
  2. Grading a texts for difficulty, given a grading of vocab and grammar constructs for difficulty?

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...