Jump to content
Learn Chinese in China

  • Why you should look around

    Since 2003, Chinese-forums.com has been helping people learn Chinese faster and get to China sooner. Our members can recommend beginner textbooks, help you out with obscure classical vocabulary, and tell you where to get the best street food in Xi'an. And we're friendly about it too. 

    Have a look at what's going on, or search for something specific. We hope you'll join us. 

grep with CC-CEDICT? (UTF8)

Recommended Posts


Bit of a techie question here. I'm pretty rusty with command-line grep as I've not used it in a few years now.


I'm doing a bit of file processing that involves finding words in the CC-CEDICT dictionary cedict_ts.u8 which I downloaded from MDBG.


I can't get grep (or egrep) to find spaces in the input, whereas a file I've created myself works fine. 


Has anyone come across this before?  Something to do with spaces after Hanzi in Unicode that means they don't match the "\s" pattern?


I created a small test file test.u8, and I can find hanzi followed by spaces in that no problem.

See terminal output below.  I've added a space before each shell prompt for readability.  I'm using bash on OSX 10.11.6.


$ cat test.u8 
Weekend News
周末快乐 other stuff
周末 other stuff


$ file test.u8
test.u8: UTF-8 Unicode text


$ grep "^周末\s" test.u8
周末 other stuff


$ grep "周末" test.u8
周末快乐 other stuff
周末 other stuff


$ file cedict_ts.u8 
cedict_ts.u8: UTF-8 Unicode English text, with very long lines, with CRLF line terminators


$ head -10 cedict_ts.u8 
# Community maintained free Chinese-English dictionary.

# Published by MDBG

# License:
# Creative Commons Attribution-Share Alike 3.0

# Referenced works:


$ grep "周末" cedict_ts.u8 
南方周末 南方周末 [Nan2 fang1 Zhou1 mo4] /Southern Weekend (newspaper)/
週末 周末 [zhou1 mo4] /weekend/
週末愉快 周末愉快 [zhou1 mo4 yu2 kuai4] /Have a nice weekend!/


$ grep "^周末\s" cedict_ts.u8 
  (no output)



Actually this also produces no output, so maybe it's not just a space issue?


$ grep "^周末" cedict_ts.u8 

  • Good question! 1

Share this post

Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.


It's not a space issue.  The format of CC-CEDICT is


Trad  Simp [pinyin] /definition/


Your regex starts with ^ so it's searching for 周末 at the start of the line.  CC-CEDICT always has the traditional form at the start of the line, so guess what the traditional version of 周末 is? (hint, it's not 周末).

  • Thanks 1
  • Helpful 2

Share this post

Link to post
Share on other sites

GAH!!  Thanks @imron, actually I was kind of hoping it was a #PEBKAC :)


I had been working in the terminal at rather low resolution and hadn't noticed.  Also StickyStudy format (which I'm aiming for) has simplified first, traditional second, so somehow that was in my head. 


So this works as expected:


$ grep "^.*\s周末\s" cedict_ts.u8 

週末 周末 [zhou1 mo4] /weekend/


And now, back to your scheduled discussion of scholarships, workplace relationships and other ships...





Share this post

Link to post
Share on other sites
31 minutes ago, mungouk said:

So this works as expected:


$ grep "^.*\s周末\s" cedict_ts.u8 

My preference would be


grep "^\S\+\s周末\s" cedict_ts.u8


Which should be slightly faster.

  • Like 2

Share this post

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...