Downloadable dictionary file?

July 3, 2007 at 03:20 AM

Are there any plans to make the dictionary underneath adsotrans widely available in a cedict or similar format? Stuff like that would help projects like Chinese Pera-kun or other client-side, offline solutions which keep a full copy of the dictionary on the client to reduce lookup overhead.

July 3, 2007 at 05:27 AM

I don't really know what you're talking about but i use stardict down the bottom there is a link to other dictionary files

July 3, 2007 at 07:54 AM

It already is, although the server is up-and-down. The current downloads directory:

http://www.adsotate.com/downloads/

July 4, 2007 at 03:03 AM

Thanks! An sqlite database, just what I was looking for.

July 6, 2007 at 02:42 PM

wulong, or anybody,

do you know how to get the data only out of the db file in a certain column order...

I'm not famliar with sqlite.. the only command I could find to extract the info is .dump and that dumps a sql execution file...

If I try to drop it in sql server or access they both freak as some of the characters are not understood.

If I could get a delimited file of only data, that would be nice.

Do you know the syntax for that?

thanks,

July 6, 2007 at 04:24 PM

@woliveri,

There are lots of ways to get data out of a database, and a couple of versions of the database available. Shouldn't be hard to get customized data output. The most important table is probably "expanded_unified". What exactly do you need to do?

July 6, 2007 at 04:34 PM

Hey trevelyan,

I became frustrated earlier today by the limitations of NJStar's database so it sent me on a search to add more entries into this application.

Seems the Adsotate db is one of the largest out there (I could find) so I was trying to extract the data into the same format which would be easily sucked up by NJStar.

So I found this topic:

How to add public domain CEDICT project as additional dictionary into NJStar?

Here:

http://www.njstar.com/support/NJStar_Word_Processors/Chinese_Word_Processor/

They have the following instructions:

Download NJStar's CEDICT dictionary tool "Makdict.zip", then uncompress the files into new folder;

Go to the following web page and select to download "CEDICT in GB (simplified Chinese)" http://www.mandarintools.com/cedict.html

Uncompress the file "Cedict.gb" into the above folder, and rename it to "cedic.dic";

Run both "e2cdic2.exe" and "c2edic2.exe" to generate dictionary index files - e2cdic.dic and c2edic.dic;

Copy the 3 files (cedic.dic / e2cdic.dic / c2edic.dic) into NJStar Chinese WP folder;

Open NJStar Chinese WP and select "Tools", "Options", "Dictionaries" and set the additional dictionary as follows:

but the CEDict db didn't do what I wanted so I turned to the ADSOTATE DB

So, that's what I'm trying to do....

Is it possible?

Thanks in advance.

July 6, 2007 at 04:37 PM

The data isn't really very normalized (in formal database terms), so it'll be hard for you to construct a query that will get you what you need.

You could take the dump file and run a parsing program to get close to what you need (simply remove the INSERT INTO ... crap).

I assume, however, that you're looking for cedict like format. For that you could write a quick perl or ruby (or your favorite language) script to extract it.

Here's a quick ruby one:

#!/usr/bin/env ruby
require 'rubygems'
require 'sqlite3'
require 'iconv'

db = SQLite3::Database.new("adso.db")
File.open("adso.cedict.style", "wb") do |f|
 db.execute("select name from sqlite_master").
     select { |t| t.first =~ /^_([^_]+)$/ and ![485, 486, 488].include?($1.to_i) }.
     each do |t|
   db.execute("select * from #{t}").each do |r|
     r = r.values_at(2, 3, 10, 11).map { |s| Iconv.iconv("utf-8", "utf-8", s) rescue nil }
     next if r.any? { |s| s.nil? || s.empty? }
     eng, pinyin, simp, trad = r
     f.puts "#{simp} #{trad} [#{pinyin}] /#{eng}/"
   end
 end
end

Few things to note:

* Tables _485, _486, and _488 are corrupted

* Some characters aren't truly utf-8, so there's a line in there that throws out anything that isn't true utf-8

* For some reason some entries have empty English and pinyin. Not quite sure why.

* This script generates about 180,000 lines. There are probably dupes, but a quick glance looks pretty good.

July 6, 2007 at 04:39 PM

my last post was done in reply to your first post on this thread. took me awhile to smooth out the kinks with corrupted tables and utf-8 encoding

July 6, 2007 at 04:42 PM

Let me also say why I'm using NJStar (as you may be wondering).....

I like the dictionary feature where I can hover over a word and get the definition and pinyin.

But also, I can get it to write it in the processor so it saves me some time when I'm trying to define words.

Like this:

图书馆【túshūguǎn】 library.

I can then copy this directly into Word where I've created a macro to convert all the Chinese fonts into something nicer looking and also enlarge the font as well.

I came into a problem when I couldn't locate words that I was able to find in PlecoDict so that sent me on my search.

Thanks,

July 7, 2007 at 09:08 AM

Here's a zipped CSV file that I extracted from the latest development database. Format is

simplified,traditional,pinyin,english

http://e.den.li/adso-csv.zip (2.3M)

Hope this helps.

PS. Here's the same data but in a single sqlite table:

http://e.den.li/adso.single.db.gz (6.8M)

Schema:

CREATE TABLE entries (
     id INTEGER PRIMARY KEY AUTOINCREMENT,
     simp TEXT UNIQUE,
     trad TEXT UNIQUE,
     pinyin TEXT UNIQUE,
     english TEXT
   );

July 7, 2007 at 01:39 PM

Thanks Wulong,

I have two problems.

1. Excel cannot open the entire and so I sucked it into Access and because the delimiters ( | ) don't seem to be consistant so I have pinyin together with characters in some rows and others are ok.

2. The other file, single table with all data, appears not to be a valid archive.

July 7, 2007 at 05:29 PM

My laptop got tanked by a QQ install last week, which has stopped Adso-related work until I can get it fixed. I'll take a look at those corrupted tables when I'm back up and running.

I don't see why you can't dump in CEDICT format if you want. Part of the point of the database release is that it should be relatively simple to reformat data. The easiest way to access most of the data is to look at the table ("expanded_unified"). The SQL command "SELECT * from expanded_unified" will get you most of what you need.

The hard way of doing things is to look up the first character in the table character_index ("GB2312") or index_utf8s (simplified). The pkey in those tables corresponds to the table number containing all entries beginning with that character. If a character is listed in the index with a pkey of 84, for instance, all words starting with that character will be found in table _84.

July 8, 2007 at 03:11 AM

Hi Trevelyan,

Thanks for the reply. It seems expanded_unified contains Chinese and Pinyin but no English.

This is the query:

Questions:

1. how do I get the English translation

Thanks,

Bill

July 8, 2007 at 03:49 AM

1. Excel cannot open the entire and so I sucked it into Access and because the delimiters ( | ) don't seem to be consistant so I have pinyin together with characters in some rows and others are ok.

That's a dump directly from sqlite. You might have to fixup a few rows to get it to work.

2. The other file, single table with all data, appears not to be a valid archive.

It's a gzip file. You need to use WinZip or WinRAR if you're in Windows. If you're on Mac OS X, it should be built in.

I don't see why you can't dump in CEDICT format if you want. Part of the point of the database release is that it should be relatively simple to reformat data. The easiest way to access most of the data is to look at the table ("expanded_unified"). The SQL command "SELECT * from expanded_unified" will get you most of what you need.

I don't even see expanded_unified. Can you point me to the archive that has the database with this table?

What I need is a simple list (simplified, traditional, pinyin, english) similar to what cedict gives. The database I have doesn't make it easy to do this which is why I had to resort to using ruby in order to pull everything together.

July 8, 2007 at 04:02 AM

wulong,

Yes, I have Winzip but it fails to open the archive saying it's corrupt or other error.

I'm using SQLite Maestro to view the tables (see the above graphic in my previous post), Seems like a nice application but still cannot export to file without having memory errors or other issues.

http://www.sqlmaestro.com/products/sqlite/maestro/

July 8, 2007 at 04:17 AM

@woliveri

Hmm... haven't used Windows in awhile, but I remember running into issues with winzip and plain gzip files. Here's a zip file for you: http://e.den.li/adso.single.zip

Hopefully this one works better.

If you've installed sqlite3, you can get a dump file from the command line:

C:path>sqlite adso.single.db
sqlite> .separator ,
sqlite> .output adso.csv
sqlite> select * from entries;
sqlite> .quit

There will be a new file called adso.csv in the same directory you started sqlite.

September 4, 2007 at 02:20 AM

Wulong, the schema for the sqlite database seems to be incorrect:

CREATE TABLE entries (
     id INTEGER PRIMARY KEY AUTOINCREMENT,
     simp TEXT UNIQUE,
     trad TEXT UNIQUE,
     pinyin TEXT UNIQUE,
     english TEXT
   );

The UNIQUE tags should not be there.

In the current table, there is only one entry with the pronounciation a1, which can't be correct. The csv file has the same problem.

I've tried several times to access the download site http://www.adsotate.com/downloads/ but i've never been successful. Is there any other way of accessing the raw data file?

September 4, 2007 at 01:36 PM

There will be a new release in a matter of days: database plus software plus several months of updates. In the meantime, an older version is still online at:

http://www.adsotrans.com/downloads/

The adsotate.com server had technical problems and is offline..

December 25, 2007 at 01:22 AM

First of all, I would like to thank Trevelyan for putting together such a useful dictionary which help me in reading online text as well as my work translations.

Is it just me, or is the current database down? I need to create a PHP script that connects to the database and reformats the data to the format I need. I've downloaded sqlite3 & sqlite maestro and need some tips on how to get them to dump in the needed format since I can't seem to access this script. http://adsotrans.com/downloads/v5/php_script.txt Can somebody please give me some pointers?

Here's what I'm trying to do. Ideally, I would like to import the full Traditional and Simplified database into Kingsoft Powerword 2007 which accepts .txt files in ANSI format that looks like the example below.

乾淨|[gan1 jing4]rncleanrntidyrnneatrn <--format

Ends up looking like this...

乾淨 <-- lookup word

[gan1 jing4] <-- pinyin & definitions

clean

tidy

neat

Sign In

Downloadable dictionary file?

Recommended Posts

wulong

Link to comment

Share on other sites

darkprince

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

wulong

Link to comment

Share on other sites

woliveri

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

woliveri

Link to comment

Share on other sites

wulong

Link to comment

Share on other sites

wulong

Link to comment

Share on other sites

woliveri

Link to comment

Share on other sites

wulong

Link to comment

Share on other sites

woliveri

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

woliveri

Link to comment

Share on other sites

wulong

Link to comment

Share on other sites

woliveri

Link to comment

Share on other sites

wulong

Link to comment

Share on other sites

perjp

Link to comment

Share on other sites

trevelyan

Link to comment

Share on other sites

ABCinChina

Link to comment

Share on other sites

Join the conversation