A Unified Word Bank for Connecting Chinese Learning Services?

March 24, 2017 at 03:53 PM

There's a lot of great Chinese learning services out there (Skritter, FluentU, Chairman's Bao, etc), which have their own wordbanks track what words you do know, do not know, are learning, etc. These are used to keep track of your progress, help you memorize vocab, make suggestions etc.

What I think would be the ultimate next step is to have a service that provided a unified word bank. Something with an agreed upon API that sites/apps could then integrate as a service in their products. Perhaps a "non-profit" open-source database server, written in a fast back-end that can handle many concurrent users (golang?), whose costs would be perhaps supported by the products that used it?

Any thoughts? An idea worth exploring? Technical or Financial obstacles?

March 24, 2017 at 04:25 PM

Hi. You can centralize this yourself:

I used Memrise and FluentU's SRS features for a time. Also "Learning With Texts", which is free and also tracks your vocab. I did try Skritter (twice!) but each time could not make heads or tails of it. And I've also used Anki and Pleco for Chinese vocabulary and individual characters. It's exhausting trying to keep track of all those flashcards in different places, so these days I only keep my Chinese vocabulary SRS in Pleco.

I no longer use the "learn" (aka SRS) functions of any individual app/site: Remember SRS needs sustained work for months or years to yield results, so keeping your SRS in subscription services does not seem like a good long-term strategy to me... At the very best, this would lead to fragmentation of your SRS work among different places. this is especially true if you stop paying for a service (Skritter, FluentU), because, even if the service still lets you review old vocabulary, now you'll need to add new words in a different place (say, Anki or Pleco). This will make it difficult to keep track of duplicates. At worst, this would lead to loss of data if the service ever goes belly up, suffers a hacker attack or you lose your login.

For a long time I kept my Heisig keyword-to-character deck in Anki, but then I discovered Pleco's "alternating" subject selection and now I alternate reading and writing practice within the same test session. A lot more interesting than just doing one or the other: I used to zone out when doing only "reading" practice, but now I'm fully alert during the whole session. If I'm at home, I use "self-graded", because this lets me use pencil and paper for writing, and if I'm out and about I choose "fill in the blanks", as this lets me use Pleco's handwriting recognition (Skritter-style).

IMHO, if you make sure to follow the "rules of formulating knowledge", you'll be fine with just Pleco (or Anki). https://www.supermemo.com/en/articles/20rules

I did tweak Pleco's scoring to make sure I spend less time with flashcards and more time with actual Chinese content and people.

One of the big pros of Pleco's flashcard system is that you can have several different study lists (aka Categories), and if a word is in two lists or more, you won't be really "duplicating it". Thus, you could keep individual categories for each of the services/sites you use, without fear of wasting time reviewing duplicates. I guess you could even copy/export your vocab lists from those sites/apps and import them into Pleco. And I often export my list for use with Chinese Word Extractor as preparation to tackle a new text. So it's a very open system and you can easily import/export words.

If you go with Anki, I'd recommend keeping everything you can in a single deck, to prevent duplicates, and use tags to separate stuff.

March 24, 2017 at 05:09 PM

I am not sure of technical or financial obstacles apart from the fact that I don't think all the various apps and learning services would work together. Its not in their interest, they make their money by offering word lists specific to their app.

Don't they all use Unicode these days for displaying characters.

How ever much it might seem that these providers are helping us learn chinese because they want to help us, they are after all out to make a profit.

By having their word lists in different formats so that they can't be shared gives them something to charge for.

I understand where you are coming from, it does seem sensible to have one place to draw from, it also would ensure everyone is using the same set and avoid confusion with different variants being used in different apps.

It will be interesting to hear what other people have to say about it, maybe a programmer or chinese learning product providers will give their point of view.

March 24, 2017 at 05:20 PM

To be honest, the odds of everyone coming together on this seem low - I wouldn't even necessarily object to it from a business end (we aren't really in the business of charging people for wordlists anyway), but I wouldn't want our future feature improvements etc to be constrained by the need to interoperate with something we didn't control.

If we eventually do a web-based service (still quite likely - building a lot of sync code into 4.0 anyway), that would probably have a public vocabulary management API and probably also support free / easy exports to a neutral format you could bring into something other than Pleco, but it would probably also come with at least a modest subscription fee for the flashcard end, since we haven't yet come up with a way to do reliable flashcard sync cheaply enough at scale to offer it for free.

March 25, 2017 at 04:21 AM

I'm not sure this would be a productive way to study Chinese. An important part about adding vocabulary is being selective in what you add. If you are using enough different services to merit a service like this, you're likely over-reaching your vocabulary additions.

Maybe consider differentiating from what sources you use for vocabulary and what sources you use for pleasure and what sources are used for review. For example, The Chairman's Bao for review (since the content is simpler), one book for new words, and a TV show for pleasure. Do you understand everything being said on the TV show, of course not, but you don't need to understand everything to be improving your Chinese. Of course, there is also a bottom line that if you understand too little you should switch your content to something simpler.

March 25, 2017 at 07:35 AM

3 hours ago, 艾墨本 said:

I'm not sure this would be a productive way to study Chinese. An important part about adding vocabulary is being selective in what you add. If you are using enough different services to merit a service like this, you're likely over-reaching your vocabulary additions.

I think this is a slight misunderstanding of what's being proposed.

It seems to me that Lechuan is suggesting an API for managing wordlists, that providers of Chinese learning software could use, and then your list of known words could follow you around from service to service without you needing to do anything special.

March 25, 2017 at 01:20 PM

I think most of the criticism here is valid but not a reason to not do something like this. It's perfectly possible to have a set of known morphemes and lexis which is not particularly tied to any features of the app or service accessing it. It's perfectly possible for different services to maintain their own pertinent information for their functionality, while having a minimum of information in this unified API to make all entries distinct and appropriately linked between apps.

Like I said, these criticisms are valid because you would need to strike the perfect balance of data to allow full interoperability, while not bogging down services with compulsory information they don't need.

I think something as simple as Simplified, Traditional, and Part of Speech information, and a trinary unlearned,learning,learned status would be sufficient for most serivces to share information without losing too much precision. But then you have to wonder how we would implement this sort of thing without choking it or making it prohibitively expensive to maintain.

March 25, 2017 at 02:31 PM

Parts of speech for Chinese are notoriously tricky - not only is there a lack of clear agreement on what the valid parts of speech in Chinese are (ABC lists 32 of them) but they're used fluidly enough that often dictionaries will also disagree on things like whether a particular word is an adjective or a verb or both, even if it's appearing in the exact same sentence.

Simplified/traditional is another mess - quite a few words have multiple valid traditional versions (is it 台灣 or 臺灣?), so you need either a centralized official word list to identify words as distinct or not (same simplified is not in fact a reliable key - 冷面 e.g. can stay the same in traditional or turn into 冷麵 depending on whether we're discussing faces or noodles) or to simply allow for lots of duplicate variant words. We've spent months of programmer time on variant merging in Pleco and it's still far from perfect.

You don't mention Pinyin, but nearly all dictionaries separate entries by pinyin + characters; without it you've got to deal with lots of confusing 多音字, with it you've got to deal with Taiwan pronunciation variants, erhua, whether or not a particular source incorporates tone sandhi into its Pinyin readings, final tones that are usually pronounced as neutral, etc.

Also, quite a few sources introduce different senses of a word at different times, or exclude some senses that others include; you might learn 市 'market' versus 市 'town' in different chapters of a textbook, for example. So just because one source considers you to have 'learned' 市 doesn't mean that you've learned all of the meanings of 市 that another one wants to teach you.

And of course this unlearned / learning / learned status is fraught with all sorts of complications too (e.g., does a card go back to 'unlearned' when you've forgotten it?).

My basic point here is that even for something as simple as the four fields you're proposing to track, we end up with a lot of tricky design issues that a bunch of different apps / websites are very unlikely to agree on.

March 26, 2017 at 11:27 AM

Some ideas based off my personal XML word bank:

Treat a word entry as "a headword and pronunciation pair".
Use a database-wide flag to indicate whether the word bank headword is simplified or traditional, and what pronunciation system is used (mandarin, yueyu etc.). Provide optional fields in each word entry to store headword(s) of the other the writing system and other pronunciation(s).
Permit duplicate word entries, but warn when they are created.
Provide a folder/tree system for organising word entries into groups. (e.g. difang, mingzi, chengyu)

March 26, 2017 at 12:29 PM

Quote

Parts of speech for Chinese are notoriously tricky - not only is there a lack of clear agreement on what the valid parts of speech in Chinese are (ABC lists 32 of them) but they're used fluidly enough that often dictionaries will also disagree on things like whether a particular word is an adjective or a verb or both, even if it's appearing in the exact same sentence.

This is not really an issue. We're trying to construct a resource so that items can be uniquely identified, not a linguistically accurate lexicon. Other services using this database to morpheme match could take exception to certain classifications and overule them, or not even reveal these classifications to the user. There is no problem using even PoS the creator would consider innaccurate as long as they functionally allowed us to distinguish any potential conflicts. We could even translate them to purely arbitrary numbers after a classification scheme was nailed down, if we really did not want to have a final say on linguistic matters. Or if we did, we could still do this and just make clear what each number meant and have services choose how to represent them. For instance, if we were doing Japanese (I'm choosing Japanese because I know more about its formal linguistics), we could have a separate category for adjectival nouns and adjectival verbs, but not force services to recognise the difference. Services could still be free to recognise them as nouns, verbs, adjectival nouns, adjectival verbs, so called "na-" and "i-" "adjectives" or just "adjectives".

The issue is merely providing enough information that the morphemes can be reliably labeled.

Will this work perfectly? No. Is it possible that some services will not be able to communicate perfectly every little variation and sub-category of PoS they consider important perfectly? Yes.

When I'm studiying Japanese, if MorphMan were to treat the na-adjectival form of a word and its normal noun form as identical. I might object, claiming that they might be different parts of speech (I wouldn't, but some would), but I don't think this linguistic matter would greatly decrease my ability to parse a sentence which used the na-adjectival form of what I previously learnt as just a noun.

Perfect is the enemy of good here. You just tailor your service so these issues aren't show stopping. Having something that is heavily if not perfectly unified is still good.

Quote

Simplified/traditional is another mess - quite a few words have multiple valid traditional versions (is it 台灣 or 臺灣?), so you need either a centralized official word list to identify words as distinct or not (same simplified is not in fact a reliable key - 冷面 e.g. can stay the same in traditional or turn into 冷麵 depending on whether we're discussing faces or noodles) or to simply allow for lots of duplicate variant words. We've spent months of programmer time on variant merging in Pleco and it's still far from perfect.

One could simply treat the two as variants of a morpheme. I can see a nested hierarchy here, and services could decide how they respond to those hierarchies.

Of course, this wouldn't work perfectly, but I think again, "perfect is the enemy of good" here. Something that can't perfectly map every relationship, but which does a good enough job is better than no united database at all.

Quote

You don't mention Pinyin, but nearly all dictionaries separate entries by pinyin + characters; without it you've got to deal with lots of confusing 多音字, with it you've got to deal with Taiwan pronunciation variants, erhua, whether or not a particular source incorporates tone sandhi into its Pinyin readings, final tones that are usually pronounced as neutral, etc.

Taiwan pronunciation variants aren't show stopping I don't think. Remember the way I'm framing it is a database for function, not a perfect collection of all knowledge. WHile it's conceivable that a service might want to structure things so that Taiwanese pronunciation is treated differently and learned as a sub--but-separate variant, that's up to them, and they can implement that, but it's unlikely to be necessary to most services, and unlikely to be something that needs translating across services, even if EVERY service decided to treat them that way. Minimal disruption would occur if each service had their own tracking on whether the learner had seen the Taiwanese pronunciation variant yet.

And I think this basically applies to your other points here.

Again, I'm going for a "good enough" approach here.

Quote

Also, quite a few sources introduce different senses of a word at different times, or exclude some senses that others include; you might learn 市 'market' versus 市 'town' in different chapters of a textbook, for example. So just because one source considers you to have 'learned' 市 doesn't mean that you've learned all of the meanings of 市 that another one wants to teach you.

I don't really see this as show stopping. Again, simply ignoring this leads to slight inefficiencies that would be way outweighed by the efficiencies gained by making this kind of thing more doable.

Quote

And of course this unlearned / learning / learned status is fraught with all sorts of complications too (e.g., does a card go back to 'unlearned' when you've forgotten it?).

I'll grant you that, but this could be resolved by introducing a layer of abstraction between the services conception of this, and the service's standard. This could even conceivably be up to the user of the database.

Quote

My basic point here is that even for something as simple as the four fields you're proposing to track, we end up with a lot of tricky design issues that a bunch of different apps / websites are very unlikely to agree on.

I don't think we really need them to agree on everything, we just need them to agree enough that a good enough solution emerges. Probably over time tweaks to what information was considered pertinent would need to be made, or finer grain distinctions. Forward planning could mitigate the deliterious effects of these.

As much as I would love one unified database of detailed information that could be accessed via an API to generate the best Anki cards ever or act as the greatest database of morpheme relations and etymology yet conceived and linguistic understanding or something, that's really not what I'm talking about here.

I respect that you probably have expertise and experience in these kind of matters that I'm missing. But I also think your experience working with and working on a dictionary app makes you think in terms of information accuracy and completeness, and I'm not really talking about information interchange. I'm talking about a "good enough" workable solution to tracking learning fairly well between a diverse array of apps and services that are inevitably going to differ on what they consider important.

All one needs is a schema that apps can map their own stuff to, not something that perfectly matches up 1-to-1 and contains all the level of detail every single app holds within every single framework or one accepted framework. You just need something that's close enough to the useful level of distinction that any inefficiencies generated by stuff like not accounting for the different pronunciation of a Taiwan accent is minimised enough to make the benefits outweigh it.

Anyway, let me know if you think I have the wrong idea. Like I said, I'm sure you're more aware of pitfalls than I, but I do also think you're falling prey to near perfection and a disconnect on what I'm aspiring to propose and what you're knocking down.

March 28, 2017 at 03:03 PM

Sorry for the slow reply.

I'll certainly admit that given my background I'm likely to be fussier about these things than most. However, in this case it's important to note that this wouldn't just be happening once; there's the potential to introduce errors / duplicates / data loss every time you sync with another app or website. So the more you use this service, the more likely you are to get to the point where the level of inaccuracy is no longer acceptable.

I'd also be wary as a developer of potentially getting blamed for imperfections caused by a) different apps or b) the inherent limitations of a sync service like this; if somebody uses this system to bring some words from another app into Pleco and then finds a lot of them are missing / duplicated / don't match up correctly, since that transaction took place within my app I'm the most likely one to get an angry support email.

I can see this working a bit better if it was one-way, i.e., you dump words into it, search / review / download your word lists from the service itself, and perhaps also offer the option of checking word statuses in it via an API, but you don't expect to actually pull data out of it and into another app, or to use it to sync data between apps. However, in that scenario you're also imposing a lot more workload on this non-profit open-source service.

If I were to try to build something like this I'd probably focus on a data interchange format without introducing an intermediary service at all; simply come up with a standard way for an app / website to accurately dump as much data as it possibly can about your known words to a file that other apps can read. So a lot of items like tone sandhi / variant status / etc can be optionally dumped to this file - could even support dictionary entry identifiers so that if, say, Pleco and Wenlin want to share links to specific ABC entries we can do that - but other apps aren't necessarily bound to pay attention to those fields if they don't care about them; at the same time, two apps that want to work together would now have a neutral format to share data in and could potentially even add custom handling for each other's peculiarities. Basically, instead of trying to impose a centralized service on everything, you simply facilitate sharing between individual apps and leave it to developers to sort out the quirks.

March 31, 2017 at 02:59 AM

A central website with some sort of simple on-demand SRS word sync API (containing eg "user X learned pinyin to word A to 75%, tones to 50%, translation to <language X> to 100%") would be nice. Even without a uniform interpretation of these numbers - be they Anki's "difficulty" or Skritter's %, they can be roughly recalculated.

Personally - pulling words out of Skritter to CTA or Anki is kinda annoying, probably because Skritter is so slow in building lists with a hundred+ words. Then again, someone has to set it up and run it, and a few years later everyone will be using it ;)

April 11, 2017 at 10:02 PM

Thanks everyone for your input! My thoughts after considering all feedback is that having a central system to sync and store would be a bad idea. A lot of data flying around, and just one bad or poorly coded service could destroy all the scores on your cards depending on how they interpret things.

But what I think might work well is if we could categorize two different types of services that could inter-operate:

1) Word Database (ie. Skritter, Pleco, Anki)

2) Vocabulary Source (ie. Chinesepod, FluentU, Chairman's Bao, etc)

In such a scenario, a user is likely to only have 1 "Word Database", and one or more "Vocabulary Sources".

"Vocabulary Sources" could query the "Word Database" to get a list of known/characters words. This would be simple, no need to consider homonyms. They would then use this information to suggest content at your level.

"Vocabulary Sources" could also export selected (sets) of vocabulary to a list in your "Word Database".

This would also eliminate the need to use the "walled garden" vocabulary review tools that most of the "Vocabulary Source" type tools currently offer.

I think that simple scenario would cover what I was hoping to achieve. It wouldn't require a middle-man, just a simple API specification. Costs could be covered as part of existing subscriptions to the "Vocabulary Sources" and "Word Databases" (or in the case of Pleco, a new subscription model for flashcards). Ideally, would just need to log in to your "Vocabulary Source" service, then specify your "Word Database" credentials to hook them up.

Sign In

A Unified Word Bank for Connecting Chinese Learning Services?

Recommended Posts

lechuan

Link to comment

Share on other sites

mlescano

Link to comment

Share on other sites

Shelley

Link to comment

Share on other sites

mikelove

Link to comment

Share on other sites

艾墨本

Link to comment

Share on other sites

imron

Link to comment

Share on other sites

NinKenDo

Link to comment

Share on other sites

mikelove

Link to comment

Share on other sites

pross

Link to comment

Share on other sites

NinKenDo

Link to comment

Share on other sites

mikelove

Link to comment

Share on other sites

werewitt

Link to comment

Share on other sites

lechuan

Link to comment

Share on other sites

Join the conversation