Jump to content
Chinese-Forums
  • Sign Up

Downloadable dictionary file?


wulong

Recommended Posts

Are there any plans to make the dictionary underneath adsotrans widely available in a cedict or similar format? Stuff like that would help projects like Chinese Pera-kun or other client-side, offline solutions which keep a full copy of the dictionary on the client to reduce lookup overhead.

Link to comment
Share on other sites

wulong, or anybody,

do you know how to get the data only out of the db file in a certain column order...

I'm not famliar with sqlite.. the only command I could find to extract the info is .dump and that dumps a sql execution file...

If I try to drop it in sql server or access they both freak as some of the characters are not understood.

If I could get a delimited file of only data, that would be nice.

Do you know the syntax for that?

thanks,

Link to comment
Share on other sites

@woliveri,

There are lots of ways to get data out of a database, and a couple of versions of the database available. Shouldn't be hard to get customized data output. The most important table is probably "expanded_unified". What exactly do you need to do?

Link to comment
Share on other sites

Hey trevelyan,

I became frustrated earlier today by the limitations of NJStar's database so it sent me on a search to add more entries into this application.

Seems the Adsotate db is one of the largest out there (I could find) so I was trying to extract the data into the same format which would be easily sucked up by NJStar.

So I found this topic:

How to add public domain CEDICT project as additional dictionary into NJStar?

Here:

http://www.njstar.com/support/NJStar_Word_Processors/Chinese_Word_Processor/

They have the following instructions:

Download NJStar's CEDICT dictionary tool "Makdict.zip", then uncompress the files into new folder;

Go to the following web page and select to download "CEDICT in GB (simplified Chinese)" http://www.mandarintools.com/cedict.html

Uncompress the file "Cedict.gb" into the above folder, and rename it to "cedic.dic";

Run both "e2cdic2.exe" and "c2edic2.exe" to generate dictionary index files - e2cdic.dic and c2edic.dic;

Copy the 3 files (cedic.dic / e2cdic.dic / c2edic.dic) into NJStar Chinese WP folder;

Open NJStar Chinese WP and select "Tools", "Options", "Dictionaries" and set the additional dictionary as follows:

but the CEDict db didn't do what I wanted so I turned to the ADSOTATE DB

So, that's what I'm trying to do....

Is it possible?

Thanks in advance.

Link to comment
Share on other sites

The data isn't really very normalized (in formal database terms), so it'll be hard for you to construct a query that will get you what you need.

You could take the dump file and run a parsing program to get close to what you need (simply remove the INSERT INTO ... crap).

I assume, however, that you're looking for cedict like format. For that you could write a quick perl or ruby (or your favorite language) script to extract it.

Here's a quick ruby one:

#!/usr/bin/env ruby
require 'rubygems'
require 'sqlite3'
require 'iconv'

db = SQLite3::Database.new("adso.db")
File.open("adso.cedict.style", "wb") do |f|
 db.execute("select name from sqlite_master").
     select { |t| t.first =~ /^_([^_]+)$/ and ![485, 486, 488].include?($1.to_i) }.
     each do |t|
   db.execute("select * from #{t}").each do |r|
     r = r.values_at(2, 3, 10, 11).map { |s| Iconv.iconv("utf-8", "utf-8", s) rescue nil }
     next if r.any? { |s| s.nil? || s.empty? }
     eng, pinyin, simp, trad = r
     f.puts "#{simp} #{trad} [#{pinyin}] /#{eng}/"
   end
 end
end

Few things to note:

* Tables _485, _486, and _488 are corrupted

* Some characters aren't truly utf-8, so there's a line in there that throws out anything that isn't true utf-8

* For some reason some entries have empty English and pinyin. Not quite sure why.

* This script generates about 180,000 lines. There are probably dupes, but a quick glance looks pretty good.

Link to comment
Share on other sites

Let me also say why I'm using NJStar (as you may be wondering).....

I like the dictionary feature where I can hover over a word and get the definition and pinyin.

But also, I can get it to write it in the processor so it saves me some time when I'm trying to define words.

Like this:

图书馆【túshūguǎn】 library.

I can then copy this directly into Word where I've created a macro to convert all the Chinese fonts into something nicer looking and also enlarge the font as well.

I came into a problem when I couldn't locate words that I was able to find in PlecoDict so that sent me on my search.

Thanks,

Link to comment
Share on other sites

Here's a zipped CSV file that I extracted from the latest development database. Format is

simplified,traditional,pinyin,english

http://e.den.li/adso-csv.zip (2.3M)

Hope this helps.

PS. Here's the same data but in a single sqlite table:

http://e.den.li/adso.single.db.gz (6.8M)

Schema:

CREATE TABLE entries (
     id INTEGER PRIMARY KEY AUTOINCREMENT,
     simp TEXT UNIQUE,
     trad TEXT UNIQUE,
     pinyin TEXT UNIQUE,
     english TEXT
   );

Link to comment
Share on other sites

Thanks Wulong,

I have two problems.

1. Excel cannot open the entire and so I sucked it into Access and because the delimiters ( | ) don't seem to be consistant so I have pinyin together with characters in some rows and others are ok.

2. The other file, single table with all data, appears not to be a valid archive. :cry:

Link to comment
Share on other sites

My laptop got tanked by a QQ install last week, which has stopped Adso-related work until I can get it fixed. I'll take a look at those corrupted tables when I'm back up and running.

I don't see why you can't dump in CEDICT format if you want. Part of the point of the database release is that it should be relatively simple to reformat data. The easiest way to access most of the data is to look at the table ("expanded_unified"). The SQL command "SELECT * from expanded_unified" will get you most of what you need.

The hard way of doing things is to look up the first character in the table character_index ("GB2312") or index_utf8s (simplified). The pkey in those tables corresponds to the table number containing all entries beginning with that character. If a character is listed in the index with a pkey of 84, for instance, all words starting with that character will be found in table _84.

Link to comment
Share on other sites

1. Excel cannot open the entire and so I sucked it into Access and because the delimiters ( | ) don't seem to be consistant so I have pinyin together with characters in some rows and others are ok.

That's a dump directly from sqlite. You might have to fixup a few rows to get it to work.

2. The other file, single table with all data, appears not to be a valid archive.

It's a gzip file. You need to use WinZip or WinRAR if you're in Windows. If you're on Mac OS X, it should be built in.

I don't see why you can't dump in CEDICT format if you want. Part of the point of the database release is that it should be relatively simple to reformat data. The easiest way to access most of the data is to look at the table ("expanded_unified"). The SQL command "SELECT * from expanded_unified" will get you most of what you need.

I don't even see expanded_unified. Can you point me to the archive that has the database with this table?

What I need is a simple list (simplified, traditional, pinyin, english) similar to what cedict gives. The database I have doesn't make it easy to do this which is why I had to resort to using ruby in order to pull everything together.

Link to comment
Share on other sites

wulong,

Yes, I have Winzip but it fails to open the archive saying it's corrupt or other error.

I'm using SQLite Maestro to view the tables (see the above graphic in my previous post), Seems like a nice application but still cannot export to file without having memory errors or other issues.

http://www.sqlmaestro.com/products/sqlite/maestro/

Link to comment
Share on other sites

@woliveri

Hmm... haven't used Windows in awhile, but I remember running into issues with winzip and plain gzip files. Here's a zip file for you: http://e.den.li/adso.single.zip

Hopefully this one works better.

If you've installed sqlite3, you can get a dump file from the command line:

C:path>sqlite adso.single.db
sqlite> .separator ,
sqlite> .output adso.csv
sqlite> select * from entries;
sqlite> .quit

There will be a new file called adso.csv in the same directory you started sqlite.

Link to comment
Share on other sites

  • 1 month later...

Wulong, the schema for the sqlite database seems to be incorrect:

CREATE TABLE entries (
     id INTEGER PRIMARY KEY AUTOINCREMENT,
     simp TEXT UNIQUE,
     trad TEXT UNIQUE,
     pinyin TEXT UNIQUE,
     english TEXT
   );

The UNIQUE tags should not be there.

In the current table, there is only one entry with the pronounciation a1, which can't be correct. The csv file has the same problem.

I've tried several times to access the download site http://www.adsotate.com/downloads/ but i've never been successful. Is there any other way of accessing the raw data file?

Link to comment
Share on other sites

  • 3 months later...

First of all, I would like to thank Trevelyan for putting together such a useful dictionary which help me in reading online text as well as my work translations.

Is it just me, or is the current database down? I need to create a PHP script that connects to the database and reformats the data to the format I need. I've downloaded sqlite3 & sqlite maestro and need some tips on how to get them to dump in the needed format since I can't seem to access this script. http://adsotrans.com/downloads/v5/php_script.txt Can somebody please give me some pointers? :help

Here's what I'm trying to do. Ideally, I would like to import the full Traditional and Simplified database into Kingsoft Powerword 2007 which accepts .txt files in ANSI format that looks like the example below.

乾淨|[gan1 jing4]rncleanrntidyrnneatrn <--format

Ends up looking like this...

乾淨 <-- lookup word

[gan1 jing4] <-- pinyin & definitions

clean

tidy

neat

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...