Jump to content
Chinese-Forums
  • Sign Up

grep with CC-CEDICT? (UTF8)


mungouk

Recommended Posts

Bit of a techie question here. I'm pretty rusty with command-line grep as I've not used it in a few years now.

 

I'm doing a bit of file processing that involves finding words in the CC-CEDICT dictionary cedict_ts.u8 which I downloaded from MDBG.

 

I can't get grep (or egrep) to find spaces in the input, whereas a file I've created myself works fine. 

 

Has anyone come across this before?  Something to do with spaces after Hanzi in Unicode that means they don't match the "\s" pattern?

 

I created a small test file test.u8, and I can find hanzi followed by spaces in that no problem.


See terminal output below.  I've added a space before each shell prompt for readability.  I'm using bash on OSX 10.11.6.

 

$ cat test.u8 
Weekend News
Weekend
周末快乐 other stuff
周末 other stuff

 

$ file test.u8
test.u8: UTF-8 Unicode text

 

$ grep "^周末\s" test.u8
周末 other stuff

 

$ grep "周末" test.u8
周末快乐 other stuff
周末 other stuff

 

$ file cedict_ts.u8 
cedict_ts.u8: UTF-8 Unicode English text, with very long lines, with CRLF line terminators

 

$ head -10 cedict_ts.u8 
# CC-CEDICT
# Community maintained free Chinese-English dictionary.

# Published by MDBG

# License:
# Creative Commons Attribution-Share Alike 3.0
#
http://creativecommons.org/licenses/by-sa/3.0/

# Referenced works:

 

$ grep "周末" cedict_ts.u8 
南方周末 南方周末 [Nan2 fang1 Zhou1 mo4] /Southern Weekend (newspaper)/
週末 周末 [zhou1 mo4] /weekend/
週末愉快 周末愉快 [zhou1 mo4 yu2 kuai4] /Have a nice weekend!/

 

$ grep "^周末\s" cedict_ts.u8 
$
  (no output)

 

 

Actually this also produces no output, so maybe it's not just a space issue?

 

$ grep "^周末" cedict_ts.u8 

  • Good question! 1
Link to comment
Share on other sites

It's not a space issue.  The format of CC-CEDICT is

 

Trad  Simp [pinyin] /definition/

 

Your regex starts with ^ so it's searching for 周末 at the start of the line.  CC-CEDICT always has the traditional form at the start of the line, so guess what the traditional version of 周末 is? (hint, it's not 周末).

  • Thanks 1
  • Helpful 2
Link to comment
Share on other sites

GAH!!  Thanks @imron, actually I was kind of hoping it was a #PEBKAC :)

 

I had been working in the terminal at rather low resolution and hadn't noticed.  Also StickyStudy format (which I'm aiming for) has simplified first, traditional second, so somehow that was in my head. 

 

So this works as expected:

 

$ grep "^.*\s周末\s" cedict_ts.u8 

週末 周末 [zhou1 mo4] /weekend/

 

And now, back to your scheduled discussion of scholarships, workplace relationships and other ships...

 

Cheers!

 

 

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...