Jump to content
Chinese-Forums
  • Sign Up

Python script to retrieve articles from the CRI website


imron

Recommended Posts

If you've ever listened to the streaming news broadcasts of CRI, you've probably also tried piecing together the script for the broadcast by opening up all the links to the related articles, and copying/pasting the text into one complete document.

Doing this on a regular basis gets old pretty fast, and so I've written a small utility that will automatically download the articles for a number of CRI broadcasts, and put them all together in one big text file.

The utility is written in the Python programming language, so if you want to run it, you will need to first make sure you have downloaded and installed Python 2.5 (you need to have version 2.5 installed, as several of the scripts make use of the 'with' statement).

I wrote this as a way to learn a bit about Python programming and besides the obligatory 'Hello World', this is my first Python program. This means that functionality is still pretty basic, but it gets the job done. For those of you that know Python well, please don't get too offended by the source code :mrgreen:

Anyway, it's quite simple to use, just download and extract the zip, then run: python crisucker.py

This will display the screen shown in crisucker1.jpg

The window contains 2 listboxes. The one on the left gets populated with all the available news broadcasts found on the CRI webpage.

You can select and move broadcasts from the list on the left to the list on the right by using the add/add all/remove/remove all buttons (see crisucker2.jpg).

Once you've moved all the broadcasts you are interested in into the list on the right, click on the button labelled "retrieve". This will open up a new window, and will download all of the news articles that make up all of the broadcasts you have selected, and place them nicely in order (see crisucker3.jpg). You can then either read the text as is, or copy/paste it into another program for printing/etc.

The program has only been tested on Windows, but because Python is an interpreted language and because I'm not doing anything platform specific, it should run under Linux/Mac without any problems.

UPDATE: Now also includes the script crilatest.py, which generates an html page with an embedded media player from the most recent broadcasts. Now you can read and listen in one easy step.

Usage: python crilatest.py [--num ] [--outdir

Eg.

python crilatest.py --num 5 --outdir ~/cri
Downloads the 5 most recent broadcasts, and creates the html files in the ~/cri directory

Typing

python crilatest.py

Will download the most recent broadcast into the current directory.

crilatest.py has only been tested under Firefox and Safari on a Mac. See below for a screenshot.

Comments/feedback/suggestions are welcome.

699_thumb.attach

700_thumb.attach

701_thumb.attach

1227_thumb.attach

crisucker.zip

Link to comment
Share on other sites

It seems to the way to write/say this in Chinese is "Python" :mrgreen: just did a quick look on the web for python documentation in Chinese, and most of them leave the name in English rather than translating it to Chinese. Although there is a project called Chinese Python, which provides a translated version of python (including things like keywords, and allowing Chinese characters for identifiers!), that goes by the name 中蟒.

Link to comment
Share on other sites

Below is a sample Chinese Python program from the page listed above :-)

答 = 整数(输入("请告诉我你的年纪: "))

如 答 == 0:

写 "别开玩笑了, 你刚出生吗 ?"

不然 答

写 "哇! 妖怪!"

不然 答 > 200:

写 "哇! 妖怪!"

写 "不! 是老妖怪!"

不然 答

写 "嘿! 小伙子"

否则:

相差 = 答 - 1

写 "你好, 你的年纪比中蟒大", 相差 , '岁.

'

It's quite funny to see an if statement in Chinese :-)

Link to comment
Share on other sites

Actually, I was thinking it probably wouldn't be too hard to drop in adsotrans support for something like this. Does Adsotrans have python bindings yet? If not, while I'm in my "teach myself python" frame of mind, I could probably have a look at that if you were interested.

Link to comment
Share on other sites

Sure -- although the easiest way to use the engine is just to use the web as your interface. I'm actually wondering more if the recording can be automatically fetched as well. This would make it possible not only to annotate the text but also automate the creation of RSS feeds with the latest audio data.

Nice idea anyway.

Link to comment
Share on other sites

  • 4 weeks later...

Here is an idea that I think would be very useful, and an easy way to give adso more exposure. I am not sure this is what imron was talking about when he inquired about python bindings for adsotrans.... (I don't have an ime on this computer sorry)

Since one can get to the adso engine from the web, what I and I bet other python programmers would find interesting is an adso python module that would work something like this...

import adso

chinesetext = .... # I guess this would be a utf8 string

adso_result=adso.adsotate( chinesetext)

adso_result would be a list of tuples which give the familiar mouse over parsing in the order the chinese appeared in the list, eg.

adso_result = [(chinese character string, associated mouseover english, other data like part of speech etc.), (next chinese word in the text, ...) ...]

Basically the proposed adso module would access the web with the usual python modules, submit the chinesetext to the adso engine, and the engine would return to the python connection the result of the adso engine in some format, python could parse the adso output and assemble the list of tuples. In other words the proposed adso module could be written in python, the only trick is having the adso engine accept and reply to such queries over the web.

imron, is this what you had in mind?

I haven't actually had a chance to look at imron's code so I'm not clear about manipulating chinese character strings in python, but given that imron wrote this as his "helloworld2.py" -- can't be that tough right?

Link to comment
Share on other sites

Initially I was thinking more along the lines of having the adso engine available as say a dll + database, that you could then access locally via python wrapper, instead of dealing with the lower level C++ stuff, however your idea sounds good too :mrgreen: perhaps a helloworld3.py is in order :mrgreen:

Trevelyan>Is there a way to access the results of an adsotrans query remotely, without having to parse the resulting HTML page?

Link to comment
Share on other sites

imron -- regarding the local copy...

If you had an adso mirror site running(hint hint), and you were running the python adso module (or helloworld3.py if you prefer) on large texts, then presumably it would be about as fast as a dll, since the network stuff would be small in comparison, expecially if you could do it over a LAN. But then also we'd all benefit from a mirror site in case of whatever. I'm a bit out of my depth here, just guessing.

Link to comment
Share on other sites

Hi,

I am also a Python developer, it's good to see more people using it for doing stuff with Chinese. The built-in Unicode support makes it especially useful. All of the work I've done on the Ocrat mirror project uses Python, the crisucker program actually is actually pretty similar to the VOA news section of the Ocrat site.

That was the other thing I was going to try next. It shouldn't be too difficult. Partial documentation regarding the mms:// protocol can be found on this website: http://sdp.ppona.com/.

It actually is pretty easy (if you don't reinvent the solution) :) Using wgetpro:

steve@leoi ~/wpro
$ wpro mms://media.chinaradio.cn/chinese/pth/xw/2006111020z.wma
--15:55:17--  mms://media.chinaradio.cn/chinese/pth/xw/2006111020z.wma
          => `2006111020z.wma'
Resolving media.chinaradio.cn... 210.51.185.30
Connecting to media.chinaradio.cn[210.51.185.30]:1755... connected.
MMS request sent, awaiting response... OK
Length: 2,216,819 [Microsoft ASF]
100%[=================================================>] 2,216,819      3.99K/s    ETA 00:00

16:04:22 (3.97 KB/s) - `2006111020z.wma' saved [2216819/2216819]

The good thing about this is that it can be automated using a os.system / os.popen call in Python (or even a shell script). Dreadfully slow from outside of China though :( You could also use something like page2rss to monitor changes and start the script when new data comes in.

Link to comment
Share on other sites

It actually is pretty easy (if you don't reinvent the solution) Using wgetpro:

Well, I haven't started on this yet, but i probably would have reinvented the solution, for 2 reasons. Firstly, I was doing this to learn python more than anything else so reinventing wheels is a good way to play around with stuff and get hands dirty :) secondly I was thinking of doing it as a pure python solution so it'd work well for anyone else, on any platform, without needing to download/compile various other 3rd party packages, which makes things easier for the less technically inclined (this was also the reason I used tk instead of something like wxWidgets, because it's already there once python is installed)

Link to comment
Share on other sites

Firstly, I was doing this to learn python more than anything else so reinventing wheels is a good way to play around with stuff and get hands dirty secondly I was thinking of doing it as a pure python solution so it'd work well for anyone else, on any platform, without needing to download/compile various other 3rd party packages

Well wgetpro is released under the GPL, so you can look at how they do it (mms.c and mms.h) in their source if you want to rewrite a similar routine in Python...

I made an attempt to convert the CRI sucker program to a command line app (runs on 2.4) that generates static HTML pages. From there it was pretty easy to get annotation working with adso.py, you can see it online here:

A page with annotated titles (like Adsotate)

Main page (list of articles)

The annotation is kinda flaky at the moment (I should have used the official Adso javascript :D)

I put the code online here.

Link to comment
Share on other sites

@skryskalla -- your VOA site is really impressive and useful. The ability to click to stream recordings while reading text is great. At just 10-11 articles per day, if it is useful you should feel free to submit the news texts to the Adso engine for processing as well as the titles. That volume won't place much load on the server.

@kudra -- I'm happy to support people who are interested in setting up mirror sites. Anyone looking to do this sort of thing should just drop a line.

Link to comment
Share on other sites

  • 7 months later...

imron:

Eight months later... I recently found the Chinese Radio International website and I have started listening to their streaming news broadcasts with great enthusiasm. Just as you described in your first post, the process of copying/pasting in one document all the individual articles related to one broadcast is a real pain. Your small script seems absolutely perfect!

Unfortunately, after following your instructions, I does not work for me. When I run the file crisucker.py, it does open the CRI sucker window, but after a few seconds of "Loading Links", it informs me whether that there was an "Error loading links" or that it is "Done". In the end, no links are shown in the window.

I am running Windows XP and I have tried to run the CRI sucker under version 2.5 and under version 2.5.1 of Python. (BTW, I understand nothing about programming and I had never heard of Python before...)

Do you have any idea of what I might be doing wrong?

Thank you!

Link to comment
Share on other sites

Yep, actually it had to do with how I was parsing the links, which was based on the date. So the code posted here works for 2006, but not 2007. Luckily it's an easy fix, just open up the file linkloader.py in something like notepad, and somewhere at the beginning there will be a line that says:

dateRE = re.compile( r'.*(2006.*)' )

Just change this to:

dateRE = re.compile( r'.*(2007.*)' )

And everything should work fine :)

Link to comment
Share on other sites

imron:

Thank you for your quick reply. I changed the date in your script and I was able to import and retrieve the articles. However, there subsist a little problem. The articles found on the CRI webpage are retrieved in the CRI sucker (as shown in the image crisucker3.jpg of your first post), but the last paragraph of all the articles is missing in the CRI sucker.

I hope that this can be fixed by another simple modification to the original code, but programming code is even more mysterious to me than Chinese characters used to be (and still are, in a large part)!

Thank you again!

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...