Jump to content
Chinese-forums.com
Learn Chinese in China

imron

Allow Duckduckgo spider

Recommended Posts

imron

@roddy can you allow the duckduckgo spider in robots.txt?

 

Google search is becoming more and more unbearable, and while I switched off it as my primary search engine a while back, it's the still the only major off-site search engine that appears to have access to the site.

 

Normally I'd use Duckduckgo, but when I do a site specific search, the results tell me that its spider has been disallowed so it can't find much content.

  • Like 1

Share this post


Link to post
Share on other sites
Site Sponsors:
Pleco for iPhone / Android iPhone & Android Chinese dictionary: camera & hand- writing input, flashcards, audio.
Study Chinese in Kunming 1-1 classes, qualified teachers and unique teaching methods in the Spring City.
Learn Chinese Characters Learn 2289 Chinese Characters in 90 Days with a Unique Flash Card System.
Hacking Chinese Tips and strategies for how to learn Chinese more efficiently
Popup Chinese Translator Understand Chinese inside any Windows application, website or PDF.
Chinese Grammar Wiki All Chinese grammar, organised by level, all in one place.

roddy

Don’t think it’s specifically banned in robots.txt, but it might have picked up an htaccess user agent / IP ban over the years, will take a look. 

Share this post


Link to post
Share on other sites
889
robots.txt:

User-agent: *
Disallow: /admin
Disallow: /profile
Disallow: /applications/core/interface/file/
Disallow: /notifications/options/
Disallow: /followed/
Disallow: /discover/followed-content/


User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
User-agent: Sogou web spider
User-agent: MJ12bot
User-agent: dotbot
User-agent: Exabot
User-agent: Wordpress/MU
User-agent: msrbot
User-agent: VB Project
User-agent: NaverBot
User-agent: Yeti
User-agent: moget
User-agent: ichiro
User-agent: Yandex
User-Agent: Charlotte
User-Agent: YoudaoBot
User-agent: sogou spider
User-Agent: bingbot
Disallow: /

https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/

 

The specific message I get is, "We would like to show you a description here but the site won't allow us."

 

 

Share this post


Link to post
Share on other sites
imron
1 hour ago, 889 said:

The specific message I get is, "We would like to show you a description here but the site won't allow us."

Yep that's the one I get too.  I figured it was from robots.txt, but that robots.txt doesn't look like it blocks it.

Share this post


Link to post
Share on other sites
roddy

Will take a look later. Having done some inefficient research while on mobile, it looks like DDG gets its search engine results via APIs to other indexes, one of which is Yandex, which is a historically badly-behaved Russian search engine. But I see similar on Yahoo, while Google looks fine. 

  • Thanks 1

Share this post


Link to post
Share on other sites
roddy

Ok, so...

 

Duckduckgo seems to crawl for ranking purposes, but not for indexing. For indexing, it pulls data from other sources - Bing/Yahoo (same thing now?) and Yandex. I had Bing and Yandex both blocked from waaaaaaaaaaay back. I've removed those blocks. Yandex seems to be better-behaved now. I think Bing still had access to the sitemap, so it could see urls and titles, and include them in the index, but not the content, and that's what was turning up in Duckduckgo. I've also allowed Baidu back in.

 

If I remember, once I see that's all working better I'll submit the site for a DDG !bang search. However, there's no guarantee of how quickly or how completely we get indexed. 

  • Thanks 3

Share this post


Link to post
Share on other sites
roddy

Descriptions are turning up on *some* bing / ddg searches, so that seems to be working. As I say though, no idea how complete and quick the process will be. 

Share this post


Link to post
Share on other sites
889

"As I say though, no idea how complete and quick the process will be."

 

Don't the access logs tell you precisely what's been spidered and when?

Share this post


Link to post
Share on other sites
roddy

Theoretically yes, but I haven’t looked at a raw access log for maybe a decade. And there’s likely a delay between spidering and inclusion in the index, and I don’t know if DDG has real-time access to that index, and a search engine looking at a page doesn’t mean it makes it into the index, so...

Share this post


Link to post
Share on other sites
Jan Finster
On 2/24/2020 at 12:02 PM, roddy said:

For indexing, it pulls data from other sources - Bing/Yahoo (same thing now?) and Yande

Are you saying DuckDuckGo is just metacrawling other search engines to get its search results? 😲 If so, then I stay with google....

 

Share this post


Link to post
Share on other sites
roddy

Not metacrawling, as such, as I understand it they use a Bing API (and lots of other sources). But I've only looked at it briefly. 

Share this post


Link to post
Share on other sites
imron
1 hour ago, Jan Finster said:

If so, then I stay with google....

It's not an "either/or" situation, it's "support both".  If you still use google, enabling Bing/Yahoo/DDG searches won't affect you in any way.  It will however make a big difference to people who don't use google search.

 

Personally, I can't stand the new look of the google search results page, and that was the driver to switching almost all my searches to DDG.  Previously I was about 60/40, with DDG being 60.  Now it's like 95/5.

Share this post


Link to post
Share on other sites
roddy

That's about 30 visits a day from Bing-bot search engines now, up from 3 at the start of the year. Still in the region of 1%-2% of search engine traffic, but all to the good. Thanks for raising it. I've submitted for a !bang search, but not sure if it'll get approved or not. 

  • Like 1

Share this post


Link to post
Share on other sites
roddy

Although... anyone using non-Google should bear in mind that Bing et al don't have so much indexed. Bing's webmaster tools are showing me 6k-8k pages indexed depending on what date in the last six months you pick (and no real upward trend). Google reports 50k pages indexed.

 

What is going up is the number of people clicking through from Bing. Hopefully as that continues it'll lead to more indexing.

 

In other spider news, Huawei seems to be desperately gorging itself on our pages as it gears up for a world without Google. Generally, it isn't making itself any friends. But our server is humming along quite nicely, it seems, so let it gorge.

Share this post


Link to post
Share on other sites
imron
2 hours ago, roddy said:

anyone using non-Google should bear in mind that Bing et al don't have so much indexed.

This matches with my experience.  There are some well known posts/threads of mine and others on here that I can find in Google with a few choice keywords that DDG fails to pick up on (both searches limited to site:chinese-forums.com).  It's getting better, but Google often still wins out for a site search of specific content.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...