Jump to content
Chinese-forums.com
Learn Chinese in China

4fingers

best way to represent Chinese characters online

Recommended Posts

4fingers

Hi,

I hope you don't mind if I ask a technical question about the best way to represent Chinese characters on websites.

I see that some websites use Unicode e.g. & #20844; while other use the actual Chinese symbol e.g. 火.

Does any one know what the difference is between these two methods or have any tips on making sure the visitor is able to view the characters?

Thanks

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

renzhe

IMHO, it's best to use unicode-encoded characters and state that in the document header so the browsers know what to use.

I don't know what you mean with " actual Chinese symbol e.g. 火.". Each (non-Ascii) character is encoded in some way. The 火 you posted is actually a unicode character, only your browser interprets it correctly and displays the character.

Share this post


Link to post
Share on other sites
chrix

Depending on your target audience, you might also wanna go with either the simplified (GB) or traditional (Big5) rather than Unicode, since many Chinese-speaking users don't use Unicode. Though as renzhe suggests, it could just be a case of crappy coding, i.e. those websites where I have to manually change the encoding in my browser might not have been coded properly.

Share this post


Link to post
Share on other sites
imron

No, go with unicode and make sure you specify that in the header of the webpage. The browser will then be able to correctly detect the encoding and adjust the display accordingly. There is no reason to use GB or Big5 encodings nowadays, and the sooner people stop using them, the better.

Share this post


Link to post
Share on other sites
chrix

yes it would make more sense this way.. .

So if you code the website correctly, it's not possible that a visitor will see 亂碼 at all? I understand that the 亂碼 I occasionally run into online might be due to wrong coding on the website, like when the website doesn't tell my browser that's set to Unicode that it is using GB or Big5... But don't some people force-set their browsers to a given character set? Wouldn't they see 亂碼 if the website was in Unicode?

Share this post


Link to post
Share on other sites
imron

In most (all?) browsers, the force encoding setting only works for the current page/tab. So, you could force a Unicode page to display as GB (and get 乱码) after it had loaded, however once you open a new tab/page, then the browser will set the language based on the what is specified in the header of the page, or based on the default language of the OS if no appropriate header is present.

So, your best choice is always to use utf-8, and always set the header. If you see a page with 乱码 it's almost guaranteed that either they are not using utf-8, or they have not set the header, or both.

Share this post


Link to post
Share on other sites
4fingers
So, your best choice is always to use utf-8, and always set the header.

So would an example of that be this:


Then there is the question of how I should represent the characters, either using escaped Unicode e.g. & #20844; or as a direct encoding:

IMHO, it's best to use unicode-encoded characters

Although imron seems to suggest that there is no difference between the two and popular sites like www.google.cn use symbols themselves.

After some reading it seems that Numeric character references are just used when the characters cant be directly encoded while in the editing process of the HTML document:

http://en.wikipedia.org/wiki/Numeric_character_reference

Numeric character references (NCR) are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

I assume there aren't any bugs about in web browsers that cause a NCR to be displayed as a different character than what it should be.

Share this post


Link to post
Share on other sites
renzhe

I think that there is a trend towards unicode for Chinese sites nowadays, anyway, so it's less likely to be a problem in the future. For example, verycd.com and many other sites use utf-8.

In this day and age, one should only use something else if there is a strong reason. Especially if using more than one language.

Share this post


Link to post
Share on other sites
imron

@4fingers, yes that is the correct way to specify the header.

I think renzhe and I are in agreement, it's just that there is perhaps some confusion on your part due to initially using slightly incorrect terminology. Normally when someone says "use unicode" as you did in your original post it doesn't refer to numeric character references e.g. #20844, but rather a direct encoding of Chinese characters in unicode (as opposed to GB2312, Big-5 etc). So when renzhe said "it's best to use unicode-encoded characters " he actually was in agreement with your second question asking if it was better to use "the actual Chinese symbol e.g. 火." (Side note, Chinese doesn't have symbols, it has characters).

Anyway, the answer, as you correctly deduced is to just use plain direcly encoded unicode characters. There is no need to use the escaped character references. I don't know of any sites that would do this, except those that demonstrate that such a thing is possible.

Share this post


Link to post
Share on other sites
renzhe

Yes. What I meant is that you should type/cut'n'paste/enter Chinese characters (using a unicode locale, not GB or Big5), and indicate the proper encoding in the header (like the example you posted).

You definitely shouldn't write the whole webpage in numeric codes like & #20844; Such a webpage would display fine, but it would be almost impossible to edit and update.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...