westmeadboy Posted April 12, 2010 at 04:30 PM Report Share Posted April 12, 2010 at 04:30 PM bit of a vague title there but... Suppose I have a traditional char A with simplified form B (not the same) 1. Can I reasonably (say, 90% sure) assume that B is not part of the Traditional character set? 2. Also, can I reasonably assume that A is not part of the Simplified character set? Quote Link to comment Share on other sites More sharing options...
cababunga Posted April 12, 2010 at 07:21 PM Report Share Posted April 12, 2010 at 07:21 PM 1. Cannot because in many cases few traditional characters were substituted by one simplified form during simplification process. For example: 只 in simplified script can be 只, 隻 or 祇 in traditional. 2. Also cannot. This time because GB character set was eventually extended to include most of the traditional characters. http://en.wikipedia.org/wiki/GB_18030 Quote Link to comment Share on other sites More sharing options...
westmeadboy Posted April 12, 2010 at 07:28 PM Author Report Share Posted April 12, 2010 at 07:28 PM (edited) Thanks very much. When I talk about the character set, I really mean the set of traditional characters used in Traditional chinese rather than the computer-style character set, if that makes sense... A) If I have a traditional character (which has a simplified variant), could it possibly match any characters in a piece of text written in Simplified chinese? B) And vice-versa? From your post I assume the answer to both is no. EDIT: Actually from your example, it looks like 只 has a traditional variant 隻 but that 只 also appears in Traditional chinese. That would mean the answer to the (A) is no, but to (B) is yes Edited April 12, 2010 at 08:37 PM by westmeadboy Quote Link to comment Share on other sites More sharing options...
cababunga Posted April 12, 2010 at 08:38 PM Report Share Posted April 12, 2010 at 08:38 PM In that case I misunderstood you. There is no ambiguity in mapping from traditional to simplified characters. So if traditional character 祇 has simplified form 只, you can be sure that 祇 can't be found in simplified text (unless by mistake). "Vice-versa" case is always possible, but if 90% confidence is really all you want, then consider this: About 2/3 of reasonably frequently used traditional characters have same appearance in simplified Chinese. Now, out of all simplified characters, 82 have more then one traditional variant (according to the data found in Unihan database). So your confidence is about 97% if you consider all 2593 simplified characters having at least one traditional variant. Can you tell me why this is important anyhow? Quote Link to comment Share on other sites More sharing options...
westmeadboy Posted April 12, 2010 at 08:42 PM Author Report Share Posted April 12, 2010 at 08:42 PM Those stats are exactly what I wanted, thanks. I'm writing some code to search the CC-CEDICT dictionary. I want to be able to search with either simplified or traditional chars and get the same results/entries, if that makes sense. Quote Link to comment Share on other sites More sharing options...
cababunga Posted April 12, 2010 at 09:02 PM Report Share Posted April 12, 2010 at 09:02 PM I just discovered something. There are seven traditional characters, which have two simplified variants each. These are :瀋, 畫, 鍾, 靦, 餘, 鯰, 鹼. This is important because so far I was convinced that there are no such cases. Here is the mapping: 瀋 -> 沈 渖 畫 -> 划 画 鍾 -> 钟 锺 靦 -> 腼 䩄 餘 -> 余 馀 鯰 -> 鲇 鲶 鹼 -> 硷 碱 Quote Link to comment Share on other sites More sharing options...
westmeadboy Posted April 12, 2010 at 09:04 PM Author Report Share Posted April 12, 2010 at 09:04 PM When I search in YellowBridge, it only shows one variant. Is that because YellowBridge is not geared up to show multiple variants? Quote Link to comment Share on other sites More sharing options...
cababunga Posted April 12, 2010 at 09:39 PM Report Share Posted April 12, 2010 at 09:39 PM Can't say anything about YellowBridge. These cases were extracted from Unihan. You can try to use any dictionary based on recent version of CEDICT. Here is an example: http://mandarinspot.com/dict?word=%E7%80%8B&phs=pinyin Quote Link to comment Share on other sites More sharing options...
westmeadboy Posted April 12, 2010 at 09:43 PM Author Report Share Posted April 12, 2010 at 09:43 PM Thanks for the useful link. Totally unrelated question - Any ideas where they get their frequency data? I've been looking for a good source for a while now... Quote Link to comment Share on other sites More sharing options...
Glenn Posted April 12, 2010 at 09:51 PM Report Share Posted April 12, 2010 at 09:51 PM 畫 -> 划 画 Is that right? I thought that 画 was from 畫 and 划 was from 劃. Quote Link to comment Share on other sites More sharing options...
chrix Posted April 12, 2010 at 09:54 PM Report Share Posted April 12, 2010 at 09:54 PM I agree with Glenn, also there's also 划 in traditional, mostly in the meaning "to row" (huá) Quote Link to comment Share on other sites More sharing options...
cababunga Posted April 13, 2010 at 12:01 AM Report Share Posted April 13, 2010 at 12:01 AM Westmeadboy, at the moment frequencies are drawn from here http://www.dataparksearch.com/, the one that's called Traditional.freq. Glenn & chrix, you are probably right. By using data from CEDICT I only could confirm such split for four out of seven characters I found in Unihan. The other tree 畫, 鯰, 鹼 map to 画, 鲇, 碱 respectively. Quote Link to comment Share on other sites More sharing options...
renzhe Posted April 13, 2010 at 07:59 PM Report Share Posted April 13, 2010 at 07:59 PM There is no ambiguity in mapping from traditional to simplified characters. So if traditional character 祇 has simplified form 只, you can be sure that 祇 can't be found in simplified text (unless by mistake). This is also incorrect, although very many people seem to believe it. Many people also choose to use traditional characters on the computer because they think that they won't lose information if they convert to simplified character-for-character. Even wikipedia is wrong on this. There is a small number of characters where the mapping of traditional characters to simplified is not unique. This was obviously not by design, it's just that the original (traditional) character set was already so rich with variants and meanings, so a perfect mapping was not always possible. For example, the character 於 is simplified into 于 when used as a preposition, but is still written as 於 when used as a surname (N.B. 于 is also a surname, but a different one!) The character 矇 is almost always simplified in the word 蒙胧, but is written as 矇 when it means "blind". You cannot always simplify it without looking at the word structure, because 蒙 has several meanings, and 矇 doesn't, so you might be introducing ambiguity. On the other hand, 矇胧 and 蒙胧 mean exactly the same thing. As does 濛扠, which is simplified to 蒙扠 and pronounced exactly the same. 朦胧, on the other hand, is pronounced the same, but means something different, and can't be simplified to 蒙胧. Everything clear? I can't think of any other now, but I remember coming across a few more. Quote Link to comment Share on other sites More sharing options...
chrix Posted April 13, 2010 at 08:04 PM Report Share Posted April 13, 2010 at 08:04 PM Also one of my pet peeves is the assumption that 祇 and 纔 are commonly used in traditional: they aren't, most traditional texts use the simplifications 只 and 才, even though originally 祇 and 纔 might have been the way to write them. Just one thing I wanted off my chest (I'm not saying that anyone on this thread made this assumption) Quote Link to comment Share on other sites More sharing options...
renzhe Posted April 13, 2010 at 08:23 PM Report Share Posted April 13, 2010 at 08:23 PM I also find that, despite a de-facto standard for both traditional and simplified character sets, the issue is much more complex than "simplified uses these characters here, and traditional uses those". The phonetic loans, shorthands and other types of simplifications have simply always been a part of the Chinese language. The PRC performed a rather sweeping reform in the late 50s/early 60s which some people found to be too excessive and wanton, not without reason. But even so, many of the "new" characters were in fact old variants, handwritten variants and common loans. This has always been a part of the Chinese language. 他 was split off into 他 and 她 very recently (Lu Xun still used 他 for women), which is accepted in all regions, but 你 stayed only 你 on the mainland (妳 is the correct address for women in HK and Taiwan). 它 is considered the correct neutral pronoun, deprecating 牠 in HK (not sure about Taiwan). 那 used to be a synonym for 哪 but 哪 was split off in the 20th century. 台 is a common shorthand for 臺 even in Taiwan. We've recently had a thread where an archaic form 翫 of the character 玩 popped up, but 玩 is used in modern traditional characters. The list is endless. So, regardless of how one feels about the political circumstances surrounding the simplification process in the PRC, it is simply more productive to see all these characters as variants and to use the correct variants in the proper context -- because you'll have to deal with this even if you only ever use one of the two standards. Quote Link to comment Share on other sites More sharing options...
chrix Posted April 13, 2010 at 09:20 PM Report Share Posted April 13, 2010 at 09:20 PM 妳 is the correct address for women in HK and Taiwan not true necessarily, at least not for Taiwan. It varies by person and by situation. Join Facebook and have a look, you'll be surprised Some people even use 他 to refer to women, but this strikes me as kind of archaic or overly formal usage. Quote Link to comment Share on other sites More sharing options...
renzhe Posted April 13, 2010 at 09:45 PM Report Share Posted April 13, 2010 at 09:45 PM See, it's even more difficult than I imagined! Join Facebook and have a look NEVER! Quote Link to comment Share on other sites More sharing options...
skylee Posted April 13, 2010 at 11:48 PM Report Share Posted April 13, 2010 at 11:48 PM but 你 stayed only 你 on the mainland (妳 is the correct address for women in HK and Taiwan). 它 is considered the correct neutral pronoun, deprecating 牠 in HK (not sure about Taiwan). I think it really depends on one's preference. I don't use 妳, but some people do. I don't use 牠 (but I was taught to use it on animals), and I don't think it is widely used nowadays. I am not sure what is taught in schools nowadays, though. Quote Link to comment Share on other sites More sharing options...
chrix Posted April 13, 2010 at 11:57 PM Report Share Posted April 13, 2010 at 11:57 PM oh right there is also 祂, for God, though I'm not sure if it's used consistently by Christians... Quote Link to comment Share on other sites More sharing options...
skylee Posted April 14, 2010 at 12:02 AM Report Share Posted April 14, 2010 at 12:02 AM There is no ambiguity in mapping from traditional to simplified characters. So if traditional character 祇 has simplified form 只, you can be sure that 祇 can't be found in simplified text (unless by mistake). Not necessarily by mistake. 祇 exists in the simplified text as in 神祇. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.