Jump to content
Chinese-Forums
  • Sign Up

Adso instruction(s)


character

Recommended Posts

Is there a unified set of instructions for compiling Adso? The two READMEs seem slightly out of sync, and the README.txt in the root directory doesn't say where the scripts are located that it says should be run.

README.txt:

COMPILATION:

[...]

./prepare_internal

or

./prepare_mysql

These scripts will prepare the source code for compilation. Once done simply type "make". This will create the binary "adso".

--------------------

scripts/README.txt:

The scripts in this directory create the necessary files for the

compilation of the external MySQL database into the actual Adso

source code. Just type:

./run

Then go into the source directory and type:

./prepare_internal

Once that is done, type:

make

-------------------

I tried building an "internal" version last night but it didn't produce the expected output. I'll try a MySQL version later. Thanks!

Link to comment
Share on other sites

To install the internal version:

(1) go to the source directory

(2) type "./prepare_internal"

(3) type "make"

If you're doing this and getting an error message please send me details of what system you are using and what error message it is giving you.

The database files necessary to compile the internal version already exist in the "database" directory by default (you may have deleted them accidentally if you ran the scripts in the "scripts" directory). You don't need to touch anything in the scripts directory unless you are already running the MySQL version and want to compile a stand-alone version of the software using your (edited) version of the dictionary/database instead of the default version that comes with the distribution.

Link to comment
Share on other sites

I thought I did that, but the results seem to indicate something is wrong:

Unit:NonChinese:Punctuation:Terminal:Newline

    Unit:Punctuation

    Unit:Punctuation

陸 陸 陸 Unit:NonChinese

小 小 Xiao Unit:Noun:Name

風 風 風 Unit:NonChinese

...

I used a UTF-8 file and tried both databases, but the results seem to be the same. What is the format of the output, BTW?

The Unit:Punctuation above were spaces, but the output has them as control characters. I removed the control characters so this post doesn't cause problems.

Is Adso lacking a lot of traditional characters or is there some more basic problem causing it to fail to recognize characters such as feng1?

My system is Ubuntu 7.10. Is there a certain set of packages (besides the compiler and mysql) you expect to be installed?

Link to comment
Share on other sites

Traditional support is worse (we don't have many contributors from Taiwan), but I'd guess the problem if you're annotating really short passages is that the system doesn't have enough data to guess that it is traditional Chinese.

Try specifying the input encoding and script explicitly, as with:

./adso -f [input file] -ie utf8 -is traditional

If that doesn't solve your problem, please send me the text you're trying to annotate and I'll see what the problem is. We're working with a relatively new version of the software, and the code for parsing traditional characters hasn't received a lot of testing, so it's possible there's a problem somewhere and you're the unfortunate guinea pig who's running into it.

Good news is, if there *is* a problem with the software, I should be able to fix it quite quickly.

Link to comment
Share on other sites

Thanks, that made a big difference and produced useable output. Wenlin still doesn't recognize the output file as UTF-8, but perhaps that's a Windows/Linux problem, or a Wine problem.

Is there a way to get traditional and pinyin output instead of simplified and traditional? How do the "-cn -y -t" options work? Adding them all didn't seem to change anything in the output.

Some things I noticed in the output:

月圆 月圓 full moon Unit:Noun

...

圆 圓 to justify Unit:Verb

月 月 moon Unit:Noun

and

他 他 他 Unit:NonChinese

and

传奇性 傳奇性 traditional story's sex Unit:Noun
(Wenlin says "legendary")

Thanks again!

Link to comment
Share on other sites

> Is there a way to get traditional and pinyin output instead of simplified and traditional?

> How do the "-cn -y -t" options work? Adding them all didn't seem to change anything

> in the output.

-cn produces Chinese (gb2312 by default)

-y produces pinyin

-t produces english output

If the program defaults aren't good enough for your needs, you can customize output by telling the engine exactly what form of output you want. The following command should be fairly self-explanatory ("" tells the engine to pick the most likely entry for each word before continuing):

./adso -f [input file] --code --extra-code " AND "

This will produce output that looks like this, one word per line:

gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class

gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class

gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class

....

If you need to do anything more complex, take a look at the files in the grammar directory. The simplest (giza++.grammar) is designed to preprocess texts for Franz Och's GIZA++ statistical machine translation program.

> 传奇性 傳奇性 traditional story's sex Unit:Noun
(Wenlin says "legendary")

传奇性 wasn't in the backend dictionary: the nonsensical definition is the giveaway. What is happening is that Adso by default looks for compound phrases and is more aggressive about putting them together in this version. Most of the time you'll see Adso do this when it has two words which are defined ONLY as nouns.

This design decision is an attempt to increase the visibility of grammatical/sematic misclassifications in the backend dictionary. We've never had access to commercial dictionary data, so are building a better alternative. On the upside, this means we can afford to share the data freely.

I've just added 传奇性 as "legendary". If you spot a missing word, or a bad entry, the editing interface is at the address below. It's also possible now to annotate the text through the main site and use the point-and-click editing functions (click to edit words, highlight to add new phrases).:

http://adsotrans.com/uniedit.php

I'll need to change the editing interface to enable the editing of traditional characters as well.

Link to comment
Share on other sites

./adso -f [input file] --code --extra-code " AND "

I'm afraid this produces an empty file when I run it. I tried different things, but IIRC adding --vocab before --code results in the vocab output format.

The vocab format is useful enough, though if you figure out that the behavior I'm seeing is a bug and fix it, that would be great.

On the upside, this means we can afford to share the data freely.
Except for "legendary" from Wenlin's ABC. :wink:

I should be able to massage the vocab output into something closer to what I need.

Thanks again!

Link to comment
Share on other sites

steve@wearable:~/chinese_dictionary/adso-v5.022/source$ ./adso -f lxf_prologue.u8 --code --extra-code " AND " > lxf_adso_output.u8

steve@wearable:~/chinese_dictionary/adso-v5.022/source$ ls -al *.u8

-rw-r--r-- 1 steve steve 0 2008-02-21 20:48 lxf_adso_output.u8

-rw-r--r-- 1 steve steve 22809 2008-02-21 20:47 lxf_prologue.u8

---------------

Is there some debug mode I can run for you?

"The command copied above works for me." As a developer myself, "it works on my machine" isn't proof of a program's proper functioning. :wink:

I'm perfectly willing to believe there's something present/missing in my system's environment which is causing the problem. But without knowing what needs to be present for Adso to work correctly, what versions, what environment settings, etc, it's hard for me to figure out what is wrong.

Link to comment
Share on other sites

Strange. I've never run into this problem or heard of anyone else having it.... Can you let me know what OS you're running the program, along with (ideally) your version of g++.

One suggestion would be trying on a smaller file (one or two lines) rather than throwing more than 200 at the software at a time. At the least, it will speed up testing. (ie. head -n 2 [input file] > [output file]). It could be that there is a particular line that is causing problems on your machine - identifying the line in question would be useful. Also, if the file has been created on Windows can you try running dos2unix on it before running it through Adso. It would be useful to confirm or eliminate the possibility that the problems are related to Windows file formatting.

I'm working in some ways to speed up processing for those who are using Adso primarily for segmentation. With the exception of the above suggestions, I'm not sure how exactly to deal with this problem in the meantime.

Link to comment
Share on other sites

I created the adso directories and have write permission in them. Using sudo didn't change the (lack of) results, nor did using a two-line file instead.

Ubuntu 7.10 32-bit

g++ (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)

2GB RAM, plenty of free disk space

-----

Again,

./adso -f ../lxf_prologue.u8 -ie utf8 -is traditional -oe utf8 -os traditional --vocab > lxf_adso10.u8

produces output, while the command using "--code..." does not. Is there some way the codepath used by "--code..." can fail silently? Adso seems to churn for a while, then exit, having produced no output. Is it just exiting if some problem occurs (some variable is null, some error from a method call, some exception, etc.) instead of printing an error message?

Link to comment
Share on other sites

The strange thing is that the same output works for me. This suggests that it isn't an issue with the "--code" flag itself, although there might be library issues.

I know that I've been using Ubuntu 7.04. Think I'll have to install Ubuntu 7.10 to find out what is happening. Will stick something in the command line version of the next release that gives you the output you want in the meantime.

Link to comment
Share on other sites

@character -- version 5.023 is up and should solve your issues. Two new command lines:

--trad-vocab == traditional/pinyin vocabulary export option

--tonalize --> converts numeric pinyin to UTF8 tone marks

Problem otherwise might be with the compile options in the Makefile (mtune, fno-reduce, etc.), or the default --static compile flag.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...