PHP写的判定是否为“蜘蛛”(爬虫)数据的函数:
方法一:
function isCrawler() { if(ini_get('browscap')) { $browser= get_browser(NULL, true); if($browser['crawler']) { return true; } } else if (isset($_SERVER['HTTP_USER_AGENT'])){ $agent= $_SERVER['HTTP_USER_AGENT']; $crawlers= array( "/Googlebot/", "/Yahoo! Slurp;/", "/msnbot/", "/Mediapartners-Google/", "/Scooter/", "/Yahoo-MMCrawler/", "/FAST-WebCrawler/", "/Yahoo-MMCrawler/", "/Yahoo! Slurp/", "/FAST-WebCrawler/", "/FAST Enterprise Crawler/", "/grub-client-/", "/MSIECrawler/", "/NPBot/", "/NameProtect/i", "/ZyBorg/i", "/worio bot heritrix/i", "/Ask Jeeves/", "/libwww-perl/i", "/Gigabot/i", "/bot@bot.bot/i", "/SeznamBot/i", ); foreach($crawlers as $c) { if(preg_match($c, $agent)) { return true; } } } return false; }
方法二:
function isCrawler() { echo $agent= strtolower($_SERVER['HTTP_USER_AGENT']); if (!empty($agent)) { $spiderSite= array( "TencentTraveler", "Baiduspider+", "BaiduGame", "Googlebot", "msnbot", "Sosospider+", "Sogou web spider", "ia_archiver", "Yahoo! Slurp", "YoudaoBot", "Yahoo Slurp", "MSNBot", "Java (Often spam bot)", "BaiDuSpider", "Voila", "Yandex bot", "BSpider", "twiceler", "Sogou Spider", "Speedy Spider", "Google AdSense", "Heritrix", "Python-urllib", "Alexa (IA Archiver)", "Ask", "Exabot", "Custo", "OutfoxBot/YodaoBot", "yacy", "SurveyBot", "legs", "lwp-trivial", "Nutch", "StackRambler", "The web archive (IA Archiver)", "Perl tool", "MJ12bot", "Netcraft", "MSIECrawler", "WGet tools", "larbin", "Fish search", ); foreach($spiderSite as $val) { $str = strtolower($val); if (strpos($agent, $str) !== false) { return true; } } } else { return false; } } // if (isCrawler()){ // echo "它是蜘蛛!"; // } // else{ // echo "它不是蜘蛛!"; // }
补充:
比较常见的蜘蛛标识,如果有错误或者没有收集到的,可以留言,我回补充,感谢。
百度蜘蛛:Baiduspider
百度图片:Baiduspider-image
百度WAP:Baiduspider-mobile
百度视频:Baiduspider-video
百度新闻:Baiduspider-news
谷歌蜘蛛:Googlebot
360蜘蛛:360Spider
SOSO蜘蛛:Sosospider
雅虎蜘蛛:Yahoo
有道蜘蛛:YoudaoBot,YodaoBot
搜狗蜘蛛:Sogou News Spider,Sogou web spider、Sogou inst spider、Sogou blog、Sogou Orion spider
必应蜘蛛:bingbot
MSN蜘蛛:msnbot,msnbot-media
一搜蜘蛛:YisouSpider
Alexa蜘蛛:ia_archiver
宜搜蜘蛛:EasouSpider
即刻蜘蛛:JikeSpider
一淘网蜘蛛:EtaoSpider