Jump to content
Chinese-Forums
  • Sign Up

Attention Software Developers


trevelyan

Recommended Posts

We've just released a new version of the Adso software with better support for fellow software developers. Specifically, you can now easily incorporate Chinese-English translation, text analysis, segmentation and hanzi-to-pinyin conversion functionality in your own applications. The word easy doesn't actually begin to describe how painless this process is. We're talking about the ability to invoke a machine translation engine in a single line of code.

I'd encourage anyone interested in Chinese software development to check it out. Usage is free for Chinese-English translation, hanzi-to-pinyin conversion and text segmentation - everything you need to build incredible next-gen learning and reference applications. The software is available for download at:

http://adsotrans.com/downloads/

A quick write-up that will get you started is here. I'd recommend starting here if you don't have any experience with Adso:

http://adsotrans.com/blog/developer-corner-adso-with-your-own-cc-application/

Feedback and questions are always welcome here, by email or at Popup Chinese. If you aren't a programmer but want to help us in our effort to produce great, free NLP software for students and developers, the best way is to help spread word about what we're doing and contribute missing content to our ever-expanding linguistic database through the online Popup Chinese dictionary.

Link to comment
Share on other sites

  • 4 weeks later...

adso-v5.058.tar.gz doesn't build for me, I get:

~/Download/adso-v5.058/scripts/compile_binaries> ./run

g++ -o database.o -c ghost_database.cpp

In file included from ghost_database.cpp:2:

ghost_database.h:19: warning: ‘typedef’ was ignored in this declaration

g++ -c adso.cpp

In file included from adso.cpp:1:

adso.h:21: warning: ‘typedef’ was ignored in this declaration

In file included from adso.cpp:11:

ghost_database.h:19: warning: ‘typedef’ was ignored in this declaration

adso.cpp: In member function ‘int Adso::UTF8_C_word_lookup(std::string)’:

adso.cpp:269: error: ‘strcmp’ was not declared in this scope

adso.cpp: In member function ‘int Adso::UTF8_S_word_lookup(std::string)’:

adso.cpp:309: error: ‘strcmp’ was not declared in this scope

adso.cpp: In member function ‘int Adso::word_lookup(std::string)’:

adso.cpp:358: error: ‘strcmp’ was not declared in this scope

adso.cpp: In member function ‘std::string Adso::wordstr_lookup(std::string)’:

adso.cpp:372: error: ‘strcmp’ was not declared in this scope

make: *** [adso.o] Error 1

Link to comment
Share on other sites

Trevelyan,

I've looked at this a bit further. The code for 5.058 is not compiling at all; did you post the correct source?

For example:

  • lines saying 'typedef struct xyz;' - this does not compile, seems you need to remove the 'typedef'.
  • several libraries are not included when they are used, such as cstring and cstdlib
  • multiple parameters with the same identifier in a method definition, for example in code.h

There are probably more things, but if the code does not compile then probably it needs to be debugged as well.

I am using gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC).

Edited by m.ellison
more info
Link to comment
Share on other sites

Hmmm.... the code has compiled on pretty much every system I've tried save one (details below). I don't know if I've tried compiling it with g++ 4.3.2 (our server has 4.1.2 installed) but we've been compiling successfully with GCC since at least 2.95). Adso 5.058 is also running on at least two or three servers with different distributions at this point, so I'd lean towards a systems issue or perhaps a bug, especially since the C++ compiler is complaining about typedef. There's very little we can do if the compiler itself has problems. If there is an issue with the software we can fix it though.

I'll check out the latest version of g++ and see how compilation goes at any rate. The one system known to have problems compiling from scratch has been a 64-bit Red Hat distro. The problem there was a unicode-related bug in one of the system libraries that choked when compiling/linking because of the UTF8 content in the source file polisher.cpp. Deleting all of the UTF8 content in that file solved the problem. It's not an ideal solution. Encodings are a recipe for disaster whatever you do though.

Specific details on your are welcome (which version you're trying to compile, mysql/internal/sqlite etc.). If you wanted to set up VPN access or create a sandboxed account with the software in it I could always try to SSH in and look at the problem myself. Send me an email if you wanted to try that. I'll follow up on the compiler to see if that's an issue here too.

Link to comment
Share on other sites

I've just confirmed that Adso compiles properly on 4.1.2 when downloaded directly from the server. After looking into your bug report I looked into the compiler versions and noticed that the 4.3.* branch is the development and non-stable branch. 4.2.4 is the latest stable release. Upgraded my Ubuntu distribution to 4.2.1 since that is their latest release and had the software compile without a problem as well.

There may be a way to get the software working under 4.3.2, but unless the problem shows up in a stable GCC release it is almost certainly an issue with the GNU tools rather than us (we're pretty clean c++, although some of the inheritance is a bit tricky).If you figure out a way to work around the compiler problem while keeping the code clean I'd be happy to apply the patch. In the meantime, I'd suggest grabbing the latest stable release if you need to compile.

Link to comment
Share on other sites

Trevelyan, the GCC page (http://gcc.gnu.org/) tells me that the current release is 4.3.3 and the development branch is 4.4.0.

I am using the latest version of Fedora namely Fedora 10 updated using yum to the current patch level. I know that Fedora sometimes releases beta versions of software as production but they seem not to have done so this time; Fedora's gcc and g++ are at 4.3.2.

The GCC pages report that significant changes between 4.2 and 4.3 as set out in http://gcc.gnu.org/gcc-4.3/porting_to.html. Adso has been broken by the library changes (see "Header dependency cleanup" in that page) and also by some new error messages.

I am not sure if I have time, but I'll try to help with patches if I can.

Link to comment
Share on other sites

Compilation issues with GCC 4.3.3 are now fixed. Thanks m.e. I've also updated the database with the latest data.

http://popupchinese.com/downloads

One additional feature worth note is the new --tone-sandhi option. It is still under development (runs of three or more tones are not changed), but it already handles the basics and may be suitable for text-to-speech purposes.

Link to comment
Share on other sites

Trevelyan, now some questions about the AdsoInterface class. I notice the member functions (eg pinyinize) take and return std::string objects. Are these UTF8 encoded? Or how? Strictly speaking std::string is only for ASCII (single byte) data. The translations and pinyin forms could be ASCII only, or are they in UTF8?

I've worked this out mainly now; the strings are UTF8-encoded.

Edited by m.ellison
more info
Link to comment
Share on other sites

  • 8 months later...

Hi trevelyan

First off, thanks for creating such a great product! I've only just started to use it, and it looks really awesome.

Here's my feedback so far:

my@you:~/dev/ruby/dict/freq/adso/source$ ./adso --help
...
 -h, --help                        print this reference 
...

yet

sea@cal:~/dev/ruby/dict/freq/adso/source$ ./adso -h
Welcome to Adso. Enter Chinese text and it will be processed as per your command-line options. If you are unsure of what to do, type "quit" and then type "./adso --help" at the command prompt for instructions on using the software
>>

Further

sea@cal:~/dev/ruby/dict/freq/adso/source$ ./adso -i 我很喜欢吃中国菜  
sea@cal:~/dev/ruby/dict/freq/adso/source$ 
sea@cal:~/dev/ruby/dict/freq/adso/source$ ./adso -i 我很喜欢吃中国菜  -t
I very to like to eat Chinese food 

IMHO, it would make sense to specify -t as default when none of -t,-y,-cn are specified. It is, after all, rather unlikely that you would want _no_ output at all.

Lastly

sea@cal:~/dev/ruby/dict/freq/adso/source$ ./adso -i 干 -t
to fuck 

Is that really the most likely meaning of 干?

In case you need this info

>>sea@cal:~/dev/ruby/dict/freq/adso/source$ !:0 --version
./adso --version

Adso Chinese Text-Analysis System v5.068: (c) David Lancashire, 2009 

Chinese translation and text analysis engine. Inquiries welcome: david.lancashire@gmail.com

Thanks again, Sir Lancashire!

Regards

mke

Link to comment
Share on other sites

  • 2 weeks later...

Thanks Mike. I'll take some of these in mind.

Not sure what the default for 干 should be without much context. Disambiguation is tricky but specific suggestions are always welcome. "to do" may be better.

Edited by trevelyan
Link to comment
Share on other sites

  • 2 months later...

Happy Chinese New Years everyone. The latest version of Adso is up with a more expansive dictionary and some underlying improvements to the engine as well. Specific updates:

February 14, 2010

- updated internal compile version to add support for traditional character

input in UTF8. Not previously supported.

February 1, 2010

- eliminated infinite loop in edge-case script detection, particularly for

traditional characters missing from the database.

Janaruy 20,

- rejigged database production scripts to shorten time-to-generate and speed

up the pace of database/dictionary development. Should see more frequent

releases at this point.

- subsequent fixes to minor problems raised by the revision in database

format. particularly with regards to numbers, etc.

January 3,

- added -n flag for better integration with other Unix applications.

- fixed bug that would result in traditional entries added to an initial word being

reduplicated in certain circumstances.

October 1,

- better auto recognition of verb+complement phrases with µÃ and adjective/adverb

complements.

September 3,

- added --deconstruct-phrases command-line option. This parses text using phrase-level

data, but then breaks down the phrases into their constituent parts using the

translation information available to identify the best part of speech of sub-units.

This is different from the --no-phrases option, which ignores phrase-level data

in the database.

Enjoy.

Link to comment
Share on other sites

Hey Martin,

The bulk of the data is the database being stored in various formats (mysql, sqlite, and the internally-compiled version). What we should do is keep the latest.

We don't have a GIT repository yet, but I can look at creating one. Should also be easy to allow people to navigate the latest distribution and download specific files rather than the whole bundle.

Best,

--david

Link to comment
Share on other sites

  • 6 months later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...