Request: Handling large volumes of vocabulary

April 17, 2009 at 09:48 AM

hello,

ZDT have been my trustworthy companion ever since I started learning Chinese, and in general, I have nothing to complain about. However, I have thought of some features which would make the program much better. These problems didn't occur until recently, because one needs a lot of characters before they become significant. I have over 5000 entries now and here is one suggestion that would make it much, much easies to handle:

- Allow subdirectories

I prefer to keep my words separated into the various chapters of the relevant books I found them in, but this is very annoying when you have hundreds of categories (it takes a while to scroll).

Furthermore, subdirectories would also allow me to choose more quickly which words to test for the flashcards. Right now, I have to select a lot of folders, which is a bit annoying. Having the option of sorting my characters more efficiently would be great.

Here is another suggestion:

- Enable some search for duplicate entries

Right now I have to export the database, open it in an external program, search for the character, go back to ZDT and remove the overlapping entries. Not very convenient. Sometimes, it's also hard to spot duplicates.

These are just suggestions for making an already outstanding program even better. I will continue using ZDT, but especially the subdirectories would make me very, very happy.

Regards,

Snigel

April 19, 2009 at 11:30 AM

I agree, especially on the duplicate search. Maybe something like "compare this category with this one", with the option of deleting duplicates in the second category.

I would use this to compare new vocabulary lists with my HSK list, so I can practice only the ones I haven't studied before.

Another suggestion: merging and seperating categories, especially merging. At first I also added lists by chapter, but as I get more familiar with the words I'd like to merge the chapters into a single list.

April 19, 2009 at 02:08 PM

For duplicates:

I would prefer manual check here, but I assume you would to. All that is needed is a function that can find duplicates, preferably looking at characters and not pinyin. Of course, more fancy stuff would be good, too, but it's not essential.

For merging lists:

This is fairly easy to do:

1) Create a backup file

2) Edit the backup file with a text editor

3) Remove the category headings

4) Restore data from your modified file

5) Good to go

April 20, 2009 at 02:19 PM

Chris promised us subdirectories in version 0.8.0: http://www.chinese-forums.com/showpost.php?p=150814&postcount=3 .

Unfortunately, Chris seems to be otherwise occupied by other stuff right now, so I wouldn't hold my breath....

May 5, 2009 at 10:13 PM

Tnx for the info Snigel! Good to know that it's possible to do this manually, although some buttons for this would make it even easier.

May 11, 2009 at 03:16 PM

I prefer to keep my words separated into the various chapters of the relevant books I found them in, but this is very annoying when you have hundreds of categories (it takes a while to scroll).
Furthermore, subdirectories would also allow me to choose more quickly which words to test for the flashcards. Right now, I have to select a lot of folders, which is a bit annoying. Having the option of sorting my characters more efficiently would be great.

It's not a pretty solution, but keep in mind that at least on the Linux version, the categories are kept in a file called "user.script". When you start zdt, it reads in whichever user.script file is in the current directory. So you can create different user.script files in different directories, each with a different set of files. And you can use the backup / restore feature to move categories between difference files.

May 15, 2009 at 06:05 PM

- Enable some search for duplicate entries
Right now I have to export the database, open it in an external program, search for the character, go back to ZDT and remove the overlapping entries. Not very convenient. Sometimes, it's also hard to spot duplicates.

Here's a simple gawk script that does that in a very dumb way. Basically, it goes through all your categories, and removes all entries that have been seen before. [it matches on ALL of simplified, traditional, and pinyin.] Before you use it, RENAME YOUR USER.SCRIPT FILE FIRST. It tested it for about 2 minutes and it didn't crash anything. e.g

mv user.script user.script.good

gawk -f parse.awk user.script.good > user.script

[Assume this is under linux and you call the script parse.awk.]

Currently, it keeps the first occurrence. Note that the categories are saved in the order you created them. They are not saved in the order you see them inside zdt. To see the actual order, look for the "INSERT INTO CATEGORY VALUES" lines in your user.script file.

The ideas for improvement are endless. But if you actually use this and want a new feature, let me know.

BEGIN {
  foundCnt = 0;
}


/INSERT INTO USER_ENTRY VALUES/ {
  # Parsing the fields is a bit complex.  It's based on a comma that is NOT
  #  within single quotation marks.  So just scan one comma at a time.  Note
  #  that only the last field has commas

  # Find the first "("; this start the data
  n = index($0, "(");
  data = substr($0,n+1);
  #print data;

  # now find each field
  for ( i=1 ; i<=7 ; ++i ) {
     # See if the next character is a '
     firstQ = index(data, "'");
     if ( firstQ == 1 ) {
 # There is a ', so the field is from first ' to second '.  So find the next '
 secondQ = index(substr(data, firstQ+1), "'");
 #print substr(data, 0, secondQ+1);
 field[i] = substr(data, 0, secondQ+1);
 # assume the next field is a comma
 data = substr(data, secondQ+3);
     }
     else {
 # There is no ', so the field is until the next ,
 firstC = index(data, ",");
 #print substr(data, 0, firstC-1)
 field[i] = substr(data, 0, firstC-1);
 data = substr(data, firstC+1);
     }
  }

  # Check if we've seen this word before.
  if ( found [field[4], field[5], field[6] ] == "" ) {
     # If not, print it out, and mark it
     # as seen.  While that seems simple, it's not.  We can not print it out
     # as-is, as we need to change the index as we are removing entries.  In
     # addition, need to remember which indices we remove, so we can remove
     # them as well in the stats later
     found [field[4], field[5], field[6] ] = field[1];
     sub(field[1], foundCnt, $0);
     foundConvert[field[1]] = foundCnt;
     print $0;
     ++foundCnt
  }
  else {
     # just indicate to remove it
     foundConvert[field[1]] = -1;
  }
  next;
}


/INSERT INTO STAT VALUES/ {
  # print these out *almost* as-is.  The only change is to the index.  For
  # now, assume the index is between the first "(" and the first ","; this
  # seems a bit brittle, but it seems to be correct for now....
  findP = index($0, "(");
  findC = index($0, ",");
  idx = substr($0, findP+1, findC-findP-1);
  if ( foundConvert[idx] == -1 ) {
     # If the index was skipped, do nothing
  }
  else {
     # use the new index, and print it out
     sub(idx, foundConvert[idx], $0);
     print $0;
  }

  next;
}


# Not a word, so just print it out.
{
  print $0;
}

Sign In

Request: Handling large volumes of vocabulary

Recommended Posts

Olle Linge

Link to comment

Share on other sites

Pendragon

Link to comment

Share on other sites

Olle Linge

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

Pendragon

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

jbradfor

Link to comment

Share on other sites

Join the conversation