PHP写的判定是否为“蜘蛛”(爬虫)数据的函数:
方法一:
function isCrawler() {
if(ini_get('browscap')) {
$browser= get_browser(NULL, true);
if($browser['crawler']) {
return true;
}
} else if (isset($_SERVER['HTTP_USER_AGENT'])){
$agent= $_SERVER['HTTP_USER_AGENT'];
$crawlers= array(
"/Googlebot/",
"/Yahoo! Slurp;/",
"/msnbot/",
"/Mediapartners-Google/",
"/Scooter/",
"/Yahoo-MMCrawler/",
"/FAST-WebCrawler/",
"/Yahoo-MMCrawler/",
"/Yahoo! Slurp/",
"/FAST-WebCrawler/",
"/FAST Enterprise Crawler/",
"/grub-client-/",
"/MSIECrawler/",
"/NPBot/",
"/NameProtect/i",
"/ZyBorg/i",
"/worio bot heritrix/i",
"/Ask Jeeves/",
"/libwww-perl/i",
"/Gigabot/i",
"/bot@bot.bot/i",
"/SeznamBot/i",
);
foreach($crawlers as $c) {
if(preg_match($c, $agent)) {
return true;
}
}
}
return false;
}
方法二:
function isCrawler() {
echo $agent= strtolower($_SERVER['HTTP_USER_AGENT']);
if (!empty($agent)) {
$spiderSite= array(
"TencentTraveler",
"Baiduspider+",
"BaiduGame",
"Googlebot",
"msnbot",
"Sosospider+",
"Sogou web spider",
"ia_archiver",
"Yahoo! Slurp",
"YoudaoBot",
"Yahoo Slurp",
"MSNBot",
"Java (Often spam bot)",
"BaiDuSpider",
"Voila",
"Yandex bot",
"BSpider",
"twiceler",
"Sogou Spider",
"Speedy Spider",
"Google AdSense",
"Heritrix",
"Python-urllib",
"Alexa (IA Archiver)",
"Ask",
"Exabot",
"Custo",
"OutfoxBot/YodaoBot",
"yacy",
"SurveyBot",
"legs",
"lwp-trivial",
"Nutch",
"StackRambler",
"The web archive (IA Archiver)",
"Perl tool",
"MJ12bot",
"Netcraft",
"MSIECrawler",
"WGet tools",
"larbin",
"Fish search",
);
foreach($spiderSite as $val) {
$str = strtolower($val);
if (strpos($agent, $str) !== false) {
return true;
}
}
} else {
return false;
}
}
// if (isCrawler()){
// echo "它是蜘蛛!";
// }
// else{
// echo "它不是蜘蛛!";
// }
补充:
比较常见的蜘蛛标识,如果有错误或者没有收集到的,可以留言,我回补充,感谢。
百度蜘蛛:Baiduspider
百度图片:Baiduspider-image
百度WAP:Baiduspider-mobile
百度视频:Baiduspider-video
百度新闻:Baiduspider-news
谷歌蜘蛛:Googlebot
360蜘蛛:360Spider
SOSO蜘蛛:Sosospider
雅虎蜘蛛:Yahoo
有道蜘蛛:YoudaoBot,YodaoBot
搜狗蜘蛛:Sogou News Spider,Sogou web spider、Sogou inst spider、Sogou blog、Sogou Orion spider
必应蜘蛛:bingbot
MSN蜘蛛:msnbot,msnbot-media
一搜蜘蛛:YisouSpider
Alexa蜘蛛:ia_archiver
宜搜蜘蛛:EasouSpider
即刻蜘蛛:JikeSpider
一淘网蜘蛛:EtaoSpider