Help scraping character animation images off wikipedia

August 8, 2011 at 04:38 PM

Yes, I know this is not a programming forum, but I think people will find this useful so I ask here.

I'm trying to get the images showing stroke order off of Wikipedia. Scraping from this page. The project is far from complete, but at least I can freely use the images available. Seems I'm not the first to try this, someone already wrote a script over at wikipedia, but it's not saving the files with filename matching the image. For my use, that's not good enough.

With my limited programming experience, I already spent two days trying to fix this but I've reached a dead end.. also tried on a linux machine, also not getting the character to match the image, sometimes getting a ? in the filename. The problem is on line 123.

<?php

/*
-------------------------------------------------------------------------
get-stroke-orders.php
-------------------------------------------------------------------------

Version 1.0

Contact: http://en.wikipedia.org/wiki/User_talk:WikiLaurent

This program is free software you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundationeither version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTYwithout even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this programif not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

-------------------------------------------------------------------------

Dependency:

Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/)

-------------------------------------------------------------------------

Usage:

php get-stroke-orders.php n=<page number> t=<animation type>

<page number> - The page number (1 to 7) at http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress
<animation type> - "bw", "red" or "order"

-------------------------------------------------------------------------

Example:

Get all the gif animations:

php get-stroke-orders.php n=1 t=order
php get-stroke-orders.php n=2 t=order
php get-stroke-orders.php n=3 t=order
php get-stroke-orders.php n=4 t=order
php get-stroke-orders.php n=5 t=order
php get-stroke-orders.php n=6 t=order
php get-stroke-orders.php n=7 t=order

-------------------------------------------------------------------------

*/

require_once "simple_html_dom.php";
set_time_limit(3600 * 10);

function curl($url){
       $ch = curl_init();
       curl_setopt($ch, CURLOPT_URL,$url);
       curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
       curl_setopt($ch, CURLOPT_USERAGENT, "StrokeOrderAnimScrapper/1.0");
       $output = curl_exec($ch);
       curl_close ($ch);
       return $output;
}


function downloadAnimations($pageNumber, $type = "bw") {
       $listBaseUrl = "http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress";
//$listBaseUrl = "./Simplified_Chinese_progress.htm";
       $pageUrl = $listBaseUrl;
       if ($pageNumber > 1) $pageUrl .= "/" . $pageNumber;
	$filecount = 0;

       echo "Parsing " . $pageUrl . "\n";
       $hmlString = curl($pageUrl);
       $html = new simple_html_dom();
       $html->load($hmlString);

       $filecount = 0;
foreach ($html->find('tr') as $tr) {
               $tdIndex = 3;
               if ($type == "red") $tdIndex = 4;
               if ($type == "order") $tdIndex = 5;

               $tdimg = $tr->find("td", $tdIndex);
	$tdchar = $tr->find("td", 1);
               if (!$tdimg) continue;
               $img = $tdimg->find("img", 0);
	$char = $tdchar->plaintext;
	$char = substr($char,1);
	echo "Got this char:: " . $char . "\n";
               if (!$img)
	{
		echo "no file for char::" . $char . "\n";
		continue;
	}
               $src = $img->getAttribute("src");
               if ($type == "bw" && strpos($src, "-bw.png") === false) continue;
               if ($type == "red" && strpos($src, "-red.png") === false) continue;
               if ($type == "order" && strpos($src, "-order.gif") === false) continue;

               $lastSlashIndex = strrpos($src, "/");
               $src = substr($src, 0, $lastSlashIndex);
	$src = str_replace("/thumb", "", $src);

	//$alt = $img->getAttribute("alt");
	//if ($type == "bw") $filename = substr($alt, 1) . "-bw" . ".png";
	//if ($type == "red") $filename = substr($alt, 1) . "-red" . ".png";
	//if ($type == "order") $filename = substr($alt, 1) . "-order" . ".gif";

               echo "Downloading " . $src . "\n";
               $pngData = file_get_contents($src);
	$filecount = ++$filecount;
	$filename = $char . substr($src,-7);
	if(empty($pngData)) echo "failed to download " . $src . "\n";
	echo "filename now:: " . $filename . "\n";
               if (file_put_contents(utf8_encode($filename), $pngData)) echo $filecount . ")) " . $char . " from src: " . $src . " downloaded" . "\n\n";
	else echo $filecount . ")) " . $char . " from src: " . $src . " failed" . "\n\n";
       }
echo "got" . $filecount . "\n";
}


function getParam($name) {
       if (isset($_GET[$name])) return $_GET[$name];
       global $argv;
       foreach ($argv as $value) {;
               $pair = explode("=", $value);
               if (count($pair) < 2) continue;
               if (trim($pair[0]) != $name) continue;
               $equalPos = strpos($value, "=");
               return trim(substr($value, $equalPos + 1, strlen($value)));
       }
       return null;
}

downloadAnimations(getParam("n"), getParam("t"));

Any help appreciated,

Thanks

August 10, 2011 at 03:59 AM

It might be easier if you just show us exactly what you want.

A link to 1 or 2 of the images you want...

And what you want it to be called when you're done.

Honestly that code looks messy and I'd be better off with a description of what you actually want then looking at someone elses broken script and guessing what exactly you want from it.

August 10, 2011 at 07:03 AM

Thanks for the nudge, i didn't think anyone would be interested in solving this here.

Well I narrowed the problem down to a few lines, the problem seems to be in decoding the url, the part with the filename to be exact.

here's a cleaned up bit of code::

<?php
   $src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png";
   $pngData = file_get_contents($src);
   $fileName = basename(urldecode($src));
   file_put_contents($fileName, $pngData);
?>

If you put

http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png

in your browser, it would be transformed to

http://upload.wikimedia.org/wikipedia/commons/2/26/的-bw.png

and then when you save the file it would be called

的-bw.png

Simple enough, how to do this in php?

August 10, 2011 at 07:59 AM

Wait, ignore me.

Hmmm, can't get php to create a Chinese filename. And should really be doing other things . . .

August 10, 2011 at 10:25 AM

I don't know about theirs, but....

Just use this...

getChar('你');

<?php
/*
Author: Matty of http://www.chinese-forums.com
Release: 2011-08-10
*/
ini_set('default_charset', 'utf-8'); // <=== Probably not needed!
getChar('你');
getChar('好');


function getChar($char)
{
$folder = '';  // You could do this .... $folder = "d:/z/";
$formattedname = iconv('utf-8','gbk', $char);
if (file_exists($folder . $formattedname . '.png')) {echo "$char  - $formattedname - already downloaded.<br />\r\n";return;}
$x = file_get_contents('http://commons.wikimedia.org/wiki/File:'.$char.'-bw.png');
$full_res_pattern = '<a href="(http:\/\/upload.wikimedia.org\/wikipedia\/commons\/.*?-bw.png)" class="internal" title=".*-bw.png">Full resolution';
$file = getSingle($full_res_pattern, $x);
$image_data = file_get_contents($file);
file_put_contents($folder . $formattedname . '.png',$image_data);
}
function getSingle($pattern, $txt,$after='')
{
preg_match("/$pattern/$after", $txt, $matches);
if (isset($matches[1]))
	return $matches[1];
return false;
}
?>

August 14, 2011 at 08:12 PM

Sorry for the late reply, been away.. Well thanks for taking an interest in this.

I tried your script Matty, but this is what I get::

. - . - already downloaded.<br />
. - . - already downloaded.<br />

By the way, this is on a linux machine running php ver 5.3.5..

I'm not sure I follow your logic, $char is changing from an external list? How to do this for all files? For sure directory indexing is disabled on Wikipedia.. I'm past getting all the images, it's just getting the wrong file names.

Have you tried this on your computer? I get results will vary with each case's severity ..

August 15, 2011 at 11:14 AM

Actually I did try it, but now I'm getting slightly different results. This should work.

<?php
/*
Author: Matty of http://www.chinese-forums.com
Release: 2011-08-10
Updated: 2011-08-15
*/
ini_set('default_charset', 'utf-8'); // <=== Probably not needed!
getChar('你');
getChar('好');


function getChar($char)
{
$folder = '';  // You could do this .... $folder = "d:/z/";
$formattedname = iconv('utf-8','gbk', $char);
if (file_exists($folder . $formattedname . '.png')) {echo "$char  - $formattedname - already downloaded.<br />\r\n";return;}
$x = request('http://commons.wikimedia.org/wiki/File:'.$char.'-bw.png');
$full_res_pattern = '<a href="(http:\/\/upload.wikimedia.org\/wikipedia\/commons\/.*?-bw.png)" class="internal" title=".*-bw.png">Full resolution';
$file = getSingle($full_res_pattern, $x);
$image_data = file_get_contents($file);
file_put_contents($folder . $formattedname . '.png',$image_data);
}
function getSingle($pattern, $txt,$after='')
{
preg_match("/$pattern/$after", $txt, $matches);
if (isset($matches[1]))
	return $matches[1];
return false;
}
function request($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$output = curl_exec($ch);
curl_close ($ch);
return $output;
}
?>

August 15, 2011 at 05:39 PM

AHHH !! Script file has to be saved with utf8 encoding.. That's why I kept getting

Notice: iconv(): Detected an illegal character in input string in D:\simplehtmldom\matty.php on line 15

Ok, thanks, I'm getting files with the correct file name now. But how to use this to get files for all characters? List the characters in an array?

$chars = array('的','是','不','我'); // etc.. 

foreach($chars as $xchar)
{
   getChar($xchar);
}

Well tried that.. somehow not able to statically define array..

August 15, 2011 at 06:08 PM

Err, here's a smarter way to use your script, but i'm too sleepy to figure out what's wrong .. almost there.. sleep now. Thanks.

require_once "simple_html_dom.php";

$listBaseUrl = "http://commons.wikimedia.org/wiki/Commons:Stroke_Order_Project/Simplified_Chinese_progress";

$hmlString = request($listBaseUrl);

$html = new simple_html_dom();

$html->load($hmlString);

foreach ($html->find('tr') as $tr)

{

$tdchar = $tr->find("td", 1);

if (!$td) continue;

$char = $tdchar->plaintext;

$char = substr($char,1);

echo $char . "\n";

getChar($char);

}

Sign In

Help scraping character animation images off wikipedia

Recommended Posts

slabo

Link to comment

Share on other sites

Matty

Link to comment

Share on other sites

slabo

Link to comment

Share on other sites

roddy

Link to comment

Share on other sites

Matty

Link to comment

Share on other sites

slabo

Link to comment

Share on other sites

Matty

Link to comment

Share on other sites

slabo

Link to comment

Share on other sites

slabo

Link to comment

Share on other sites

Join the conversation