Jump to content
Chinese-forums.com
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 
mungouk

grep with CC-CEDICT? (UTF8)

Recommended Posts

mungouk

Bit of a techie question here. I'm pretty rusty with command-line grep as I've not used it in a few years now.

 

I'm doing a bit of file processing that involves finding words in the CC-CEDICT dictionary cedict_ts.u8 which I downloaded from MDBG.

 

I can't get grep (or egrep) to find spaces in the input, whereas a file I've created myself works fine. 

 

Has anyone come across this before?  Something to do with spaces after Hanzi in Unicode that means they don't match the "\s" pattern?

 

I created a small test file test.u8, and I can find hanzi followed by spaces in that no problem.


See terminal output below.  I've added a space before each shell prompt for readability.  I'm using bash on OSX 10.11.6.

 

$ cat test.u8 
Weekend News
Weekend
周末快乐 other stuff
周末 other stuff

 

$ file test.u8
test.u8: UTF-8 Unicode text

 

$ grep "^周末\s" test.u8
周末 other stuff

 

$ grep "周末" test.u8
周末快乐 other stuff
周末 other stuff

 

$ file cedict_ts.u8 
cedict_ts.u8: UTF-8 Unicode English text, with very long lines, with CRLF line terminators

 

$ head -10 cedict_ts.u8 
# CC-CEDICT
# Community maintained free Chinese-English dictionary.

# Published by MDBG

# License:
# Creative Commons Attribution-Share Alike 3.0
#
http://creativecommons.org/licenses/by-sa/3.0/

# Referenced works:

 

$ grep "周末" cedict_ts.u8 
南方周末 南方周末 [Nan2 fang1 Zhou1 mo4] /Southern Weekend (newspaper)/
週末 周末 [zhou1 mo4] /weekend/
週末愉快 周末愉快 [zhou1 mo4 yu2 kuai4] /Have a nice weekend!/

 

$ grep "^周末\s" cedict_ts.u8 
$
  (no output)

 

 

Actually this also produces no output, so maybe it's not just a space issue?

 

$ grep "^周末" cedict_ts.u8 

  • Good question! 1

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

imron

It's not a space issue.  The format of CC-CEDICT is

 

Trad  Simp [pinyin] /definition/

 

Your regex starts with ^ so it's searching for 周末 at the start of the line.  CC-CEDICT always has the traditional form at the start of the line, so guess what the traditional version of 周末 is? (hint, it's not 周末).

  • Thanks 1
  • Helpful 2

Share this post


Link to post
Share on other sites
mungouk

GAH!!  Thanks @imron, actually I was kind of hoping it was a #PEBKAC :)

 

I had been working in the terminal at rather low resolution and hadn't noticed.  Also StickyStudy format (which I'm aiming for) has simplified first, traditional second, so somehow that was in my head. 

 

So this works as expected:

 

$ grep "^.*\s周末\s" cedict_ts.u8 

週末 周末 [zhou1 mo4] /weekend/

 

And now, back to your scheduled discussion of scholarships, workplace relationships and other ships...

 

Cheers!

 

 

Share this post


Link to post
Share on other sites
imron
31 minutes ago, mungouk said:

So this works as expected:

 

$ grep "^.*\s周末\s" cedict_ts.u8 

My preference would be

 

grep "^\S\+\s周末\s" cedict_ts.u8

 

Which should be slightly faster.

  • Like 2

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×