Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
kenneth540

How to convert a BIG5 code into chinese character in VB

Recommended Posts

kenneth540

Hi All,

I am writing a Visual Basic program that pulls data from a web page (reads the html file) which contains some BIG5 code in it. I need to be able to convert these BIG5 codes into the corresponding Chinese characters before storing them into a database (MS Access).

Does anyone have any idea how to accomplish this?

For example, my data input would be "風采依然" and I need to be able to convert it to "風采依然".

TIA,

Kenneth

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

trevelyan

I'm not sure I understand what you mean. BIG5 *is* an accepted encoding for Chinese characters... do you mean that your database is having trouble with BIG5 and you want to convert the data to another encoding like GB2312 or UTF-8?

If this is what you actually mean, take a look at the GNU program iconv (library == libiconv). It comes with everything necessary to handle conversions between different encodings. You should be able to compile the functionality directly into your software.

If this isn't what you mean, it would be helpful if you'd clarify the problem.

Share this post


Link to post
Share on other sites
kenneth540

Hi trevelyan,

My input data contains the BIG5 code represnetations, the xxxxx; where xxxxx is a 5-digit code. I like to convert this to the corresponding Chinese character before I write it to the MS Access database.

Thanks,

Kenneth

Share this post


Link to post
Share on other sites
Jose

Like Trevelyan, I also find it difficult to understand what you mean by "converting the Big5 representation into the corresponding character". A Chinese character when stored electronically is nothing more, and nothing less, than a numeric representation. There are different encoding schemes: Big5, GB2312, UTF-8 and so on, and there are tools to convert between them, as Trevelyan said. But you'll always be using one particular encoding.

I don't know if there's anything I'm missing in your question. Maybe I haven't understood it properly, but I would say that converting a numeric code into a proper Chinese character is part of the visualisation process that a text processor performs to display the glyphs associated with particular letters or characters from their underlying numeric value. If you're handling the characters programatically, a character and its numeric representation are the same thing as far as I can tell.

Share this post


Link to post
Share on other sites
kenneth540

Let me try to clarify it further. The attached file is a sample HTML file that I am dealing with. In the file, there are four BIG5 code, representation four different Chinese symbols. Note, I had to change the file extension from .HTML to .TXT in order to attach it here. When the file is viewed in my web browser (after the extension has been changed back to .HTML), I see the four Chinese symbols instead of the code themselves.

Now, when the same file is fed into my Visual Basic program, the program treats it as a normal text file. It would only see the BIG5 codes. If I output what I read straight to let's say a MS Access database, it would store the BIG5 code as seen in the .HTML file into the Access table. What I want is to store the Chinese symbol in the Access table instead.

I know Access XP (English version) supports the storage of Chinese symbols as I can copy a Chinese symbol from my web brower displaying the attached .HTML file and paste it into the Access table.

I understand the Chinese symbols when stored in their most basic electronic form is nothing more than just a number (1's & 0's). However, I am not dealing at the most basic level here. I am dealing at the MS Access level.

Hope I've explained myself clearly here and thanks for the input so far.

Kenneth

Share this post


Link to post
Share on other sites
kenneth540

One more note, if there is a way I can capture the HTML output rather than the actual HTML file from a web page in Visual Basic, it will solve my problem.

Share this post


Link to post
Share on other sites
smalltownfart

I have no exp with MS Access or VB, but I have encountered similar problems

on other platforms.

Does your MS Access using unicode? I would chk the help to see how to find this out.

Typically when you are inserting these kinds of strings into a db,

you would have to do a conversion into the encoding used by your DB.

If your db is unicode/UTF-XX, there shd be some way to convert from Big5 to unicode,

chk the api available to your VB program.

Share this post


Link to post
Share on other sites
Jose

I think I understand what your problem is now. Sorry that I didn't read your first post more carefully. I was thinking that the HTML files you were using were Big5-encoded but, if I understand correctly, they use a Western encoding and they store Chinese characters in ANSI text format as " + (Big5 code)", so your input consists of the Big5 codes AS TEXT.

If I understand things correctly, I think you will have to convert that code (written as text in the HTML) into binary format, so that you get real Big5 text. To do that, you would use a function like atoi in C, like this:

/*Example in C*/

char* big5_code_as_text; /*This will store the input, a string of text with the code*/

int big5_code; /*This will store the binary value of the Big5 code, as used in Chinese text files*/

(...)/*Store input (the number after ) in big5_code_as_text*/

/*Now we can convert it to binary*/

big5_code = atoi(big5_code_as_text);

Now if you were working with a Traditional Chinese version of MS access, you could write the "big5_code" values into a file and that would appear as Chinese text when viewed under MS Access or whatever.

However, if, as seems to be your case, you're working under a non-Chinese system, the program won't recognise the text as Chinese (it will look something like "AüÍñÖè..."). I'm not sure about how MS Office programs treat multibyte text, but I guess if you convert the encoding to utf-16, the characters will appear correctly.

To convert from Big5-coded characters to a utf encoding, you will need to use a function that does that, like the libiconv library mentioned by Trevelyan. The example in C would continue like this:

int utf_code;

utf_code = WhateverConversionFunction(big5_code, bla, bla, bla);/*Just an example*/

So, we would get utf_code as the output of a conversion function.

Now, I guess you could write those utf_code values into your database. If things don't work, check the Microsoft documentation for MS Office developers to see if there is any useful information about storing Unicode strings in MS Office files. This is all I can think of. Sorry that I cannot be more concrete about conversion functions or MS Access stuff, but I'm not an expert on this.

Hope this helps.

Share this post


Link to post
Share on other sites
smalltownfart

Whoops, I didnt see your text file earlier.

What you have is a HTML file with encodings for the big5 chars.

It will render correctly in a browser but the text is not actually

in Big5 or Unicode as the previous poster noted.

If like me, you have WinXP, what you *shd* be able to do is

to:

- rename the file ext back to HTML

- open it in Internet Explorer/Firefox/whatever

browser

- select all the text and copy to clipboard

- Open Notepad, create a file of type unicode,

- paste the text into notepad & save it.

I think saving it to ANSI would also work but you

may need to set your "Language for non-unicode programs"

in regional options.

You may want to get another unicode editor - although

Notepad supports Unicode in Win2K and WinXP,

it is kinda spartan. You will find quite

a few free ones out there.

Share this post


Link to post
Share on other sites
smalltownfart

If u need to do this programatically, you shd chk out the

BIG5 to Unicode mapping table at:

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

There is a freeware (GPL'd) utility to that does the reverse of what you want

(big5 to character reference codes) - Tsai Chih-Hao's B5TOUNI,

http://technology.chtsai.org/b5touni/

Maybe you can also use this as guide for your coding efforts,

(it does basically what you want in PHP):

http://annevankesteren.nl/2005/05/character-references

Pls keep us posted on progress, I am sure other ppl could use

your utility :)

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...