Adso instruction(s)

February 13, 2008 at 04:26 PM

Is there a unified set of instructions for compiling Adso? The two READMEs seem slightly out of sync, and the README.txt in the root directory doesn't say where the scripts are located that it says should be run.

README.txt:

COMPILATION:

[...]

./prepare_internal

or

./prepare_mysql

These scripts will prepare the source code for compilation. Once done simply type "make". This will create the binary "adso".

--------------------

scripts/README.txt:

The scripts in this directory create the necessary files for the

compilation of the external MySQL database into the actual Adso

source code. Just type:

./run

Then go into the source directory and type:

./prepare_internal

Once that is done, type:

make

-------------------

I tried building an "internal" version last night but it didn't produce the expected output. I'll try a MySQL version later. Thanks!

February 14, 2008 at 10:46 AM

To install the internal version:

(1) go to the source directory

(2) type "./prepare_internal"

(3) type "make"

If you're doing this and getting an error message please send me details of what system you are using and what error message it is giving you.

The database files necessary to compile the internal version already exist in the "database" directory by default (you may have deleted them accidentally if you ran the scripts in the "scripts" directory). You don't need to touch anything in the scripts directory unless you are already running the MySQL version and want to compile a stand-alone version of the software using your (edited) version of the dictionary/database instead of the default version that comes with the distribution.

February 14, 2008 at 12:03 PM

I thought I did that, but the results seem to indicate something is wrong:

Unit:NonChinese:Punctuation:Terminal:Newline

　　 Unit:Punctuation

陸陸陸 Unit:NonChinese

小小 Xiao Unit:Noun:Name

風風風 Unit:NonChinese

...

I used a UTF-8 file and tried both databases, but the results seem to be the same. What is the format of the output, BTW?

The Unit:Punctuation above were spaces, but the output has them as control characters. I removed the control characters so this post doesn't cause problems.

Is Adso lacking a lot of traditional characters or is there some more basic problem causing it to fail to recognize characters such as feng1?

My system is Ubuntu 7.10. Is there a certain set of packages (besides the compiler and mysql) you expect to be installed?

February 14, 2008 at 06:58 PM

Traditional support is worse (we don't have many contributors from Taiwan), but I'd guess the problem if you're annotating really short passages is that the system doesn't have enough data to guess that it is traditional Chinese.

Try specifying the input encoding and script explicitly, as with:

./adso -f [input file] -ie utf8 -is traditional

If that doesn't solve your problem, please send me the text you're trying to annotate and I'll see what the problem is. We're working with a relatively new version of the software, and the code for parsing traditional characters hasn't received a lot of testing, so it's possible there's a problem somewhere and you're the unfortunate guinea pig who's running into it.

Good news is, if there *is* a problem with the software, I should be able to fix it quite quickly.

February 14, 2008 at 07:01 PM

You can specify output encoding and script as well:

-oe utf8 -os simplified

or whatever. For help with the command line instructions, just type:

./adso --help

February 15, 2008 at 01:23 AM

Thanks, that made a big difference and produced useable output. Wenlin still doesn't recognize the output file as UTF-8, but perhaps that's a Windows/Linux problem, or a Wine problem.

Is there a way to get traditional and pinyin output instead of simplified and traditional? How do the "-cn -y -t" options work? Adding them all didn't seem to change anything in the output.

Some things I noticed in the output:

月圆月圓 full moon Unit:Noun

...

圆圓 to justify Unit:Verb

月月 moon Unit:Noun

and

他他他 Unit:NonChinese

and

传奇性傳奇性 traditional story's sex Unit:Noun
(Wenlin says "legendary")

Thanks again!

February 15, 2008 at 02:12 AM

> Is there a way to get traditional and pinyin output instead of simplified and traditional?

> How do the "-cn -y -t" options work? Adding them all didn't seem to change anything

> in the output.

-cn produces Chinese (gb2312 by default)

-y produces pinyin

-t produces english output

If the program defaults aren't good enough for your needs, you can customize output by telling the engine exactly what form of output you want. The following command should be fairly self-explanatory ("" tells the engine to pick the most likely entry for each word before continuing):

./adso -f [input file] --code --extra-code " AND "

This will produce output that looks like this, one word per line:

gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class

....

If you need to do anything more complex, take a look at the files in the grammar directory. The simplest (giza++.grammar) is designed to preprocess texts for Franz Och's GIZA++ statistical machine translation program.

> 传奇性傳奇性 traditional story's sex Unit:Noun
(Wenlin says "legendary")

传奇性 wasn't in the backend dictionary: the nonsensical definition is the giveaway. What is happening is that Adso by default looks for compound phrases and is more aggressive about putting them together in this version. Most of the time you'll see Adso do this when it has two words which are defined ONLY as nouns.

This design decision is an attempt to increase the visibility of grammatical/sematic misclassifications in the backend dictionary. We've never had access to commercial dictionary data, so are building a better alternative. On the upside, this means we can afford to share the data freely.

I've just added 传奇性 as "legendary". If you spot a missing word, or a bad entry, the editing interface is at the address below. It's also possible now to annotate the text through the main site and use the point-and-click editing functions (click to edit words, highlight to add new phrases).:

http://adsotrans.com/uniedit.php

I'll need to change the editing interface to enable the editing of traditional characters as well.

February 15, 2008 at 03:20 AM

./adso -f [input file] --code --extra-code " AND "

I'm afraid this produces an empty file when I run it. I tried different things, but IIRC adding --vocab before --code results in the vocab output format.

The vocab format is useful enough, though if you figure out that the behavior I'm seeing is a bug and fix it, that would be great.

On the upside, this means we can afford to share the data freely.

Except for "legendary" from Wenlin's ABC.

I should be able to massage the vocab output into something closer to what I need.

Thanks again!

February 15, 2008 at 10:02 AM

I'll take a look with the text you forwarded later this weekend. Thx.

Suggestions for better translations than legendary would be welcome.

February 17, 2008 at 11:34 AM

Still haven't fixed the issue with 他, but the others are fixed in the software now. Details:

http://www.chinese-forums.com/showthread.php?p=141265#post141265

We don't have many vocal users who are working on traditional. Let me know what problems you run into.

February 22, 2008 at 01:40 AM

@character,

The command copied above works for me. The only difference is that "input file" needs to be renamed to the name of your input file. Also, it takes a few minutes to process because the file is quite large and you're manipulating it with external code.

February 22, 2008 at 02:13 AM

steve@wearable:~/chinese_dictionary/adso-v5.022/source$ ./adso -f lxf_prologue.u8 --code --extra-code " AND " > lxf_adso_output.u8

steve@wearable:~/chinese_dictionary/adso-v5.022/source$ ls -al *.u8

-rw-r--r-- 1 steve steve 0 2008-02-21 20:48 lxf_adso_output.u8

-rw-r--r-- 1 steve steve 22809 2008-02-21 20:47 lxf_prologue.u8

---------------

Is there some debug mode I can run for you?

"The command copied above works for me." As a developer myself, "it works on my machine" isn't proof of a program's proper functioning.

I'm perfectly willing to believe there's something present/missing in my system's environment which is causing the problem. But without knowing what needs to be present for Adso to work correctly, what versions, what environment settings, etc, it's hard for me to figure out what is wrong.

February 23, 2008 at 08:30 AM

Strange. I've never run into this problem or heard of anyone else having it.... Can you let me know what OS you're running the program, along with (ideally) your version of g++.

One suggestion would be trying on a smaller file (one or two lines) rather than throwing more than 200 at the software at a time. At the least, it will speed up testing. (ie. head -n 2 [input file] > [output file]). It could be that there is a particular line that is causing problems on your machine - identifying the line in question would be useful. Also, if the file has been created on Windows can you try running dos2unix on it before running it through Adso. It would be useful to confirm or eliminate the possibility that the problems are related to Windows file formatting.

I'm working in some ways to speed up processing for those who are using Adso primarily for segmentation. With the exception of the above suggestions, I'm not sure how exactly to deal with this problem in the meantime.

February 23, 2008 at 08:44 AM

Also, document redirection requires write permissions on the directory. Can you confirm that you have those write permissions and/or try the command using "sudo" (root will definitely have write permission).

I've had this bite me before in other situations.

February 23, 2008 at 10:36 AM

I created the adso directories and have write permission in them. Using sudo didn't change the (lack of) results, nor did using a two-line file instead.

Ubuntu 7.10 32-bit

g++ (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)

2GB RAM, plenty of free disk space

-----

Again,

./adso -f ../lxf_prologue.u8 -ie utf8 -is traditional -oe utf8 -os traditional --vocab > lxf_adso10.u8

produces output, while the command using "--code..." does not. Is there some way the codepath used by "--code..." can fail silently? Adso seems to churn for a while, then exit, having produced no output. Is it just exiting if some problem occurs (some variable is null, some error from a method call, some exception, etc.) instead of printing an error message?

February 25, 2008 at 06:59 AM

The strange thing is that the same output works for me. This suggests that it isn't an issue with the "--code" flag itself, although there might be library issues.

I know that I've been using Ubuntu 7.04. Think I'll have to install Ubuntu 7.10 to find out what is happening. Will stick something in the command line version of the next release that gives you the output you want in the meantime.

February 25, 2008 at 07:24 AM

@character -- version 5.023 is up and should solve your issues. Two new command lines:

--trad-vocab == traditional/pinyin vocabulary export option

--tonalize --> converts numeric pinyin to UTF8 tone marks

Problem otherwise might be with the compile options in the Makefile (mtune, fno-reduce, etc.), or the default --static compile flag.

February 25, 2008 at 05:53 PM

Thanks! I'll try to test it in the next couple of days.

If you want to make a debug version (printing out method entry and exit, and possibly key parameters/variable values) I'll be happy to run it and send you the output.

February 26, 2008 at 03:53 AM

this is nice. looks like the makings of an embedded solution for mobiles and blackberries.

February 26, 2008 at 05:54 AM

Should work on anything that supports the GNU C++ compiler. Issue with mobile devices is probably more with the interface: inputing content and displaying the output.

Sign In

Adso instruction(s)

Recommended Posts

character

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

character

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

character

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

character

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

character

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

character

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

character

Link to comment

Share on other sites

geek_frappa

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

Join the conversation