以下是服务器反爬虫攻略:Apache/Nginx/PHP禁止某些User Agent抓取网站,希望对大家有所帮助。
一、Apache
①、通过修改 .htaccess 文件
修改网站目录下的.htaccess,添加如下代码即可(2 种代码任选):
可用代码 (1):
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python–urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC] RewriteRule ^(.*)$ – [F]可用代码 (2):
SetEnvIfNoCase ^User–Agent$ .*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python–urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) BADBOT Order Allow,Deny Allow fromall Deny from env=BADBOT②、通过修改 httpd.conf 配置文件
找到如下类似位置,根据以下代码 新增 / 修改,然后重启 Apache 即可:
Shell
DocumentRoot /home/wwwroot/xxx <Directory “/home/wwwroot/xxx”> SetEnvIfNoCase User–Agent “.*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)” BADBOT Order allow,deny Allow fromall deny from env=BADBOT </Directory>
二、Nginx 代码
进入到 nginx 安装目录下的 conf 目录,将如下代码保存为 agent_deny.conf
cd /usr/local/nginx/conf vim agent_deny.conf #禁止Scrapy等工具的抓取 if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { return 403; } #禁止指定UA及UA为空的访问 if ($http_user_agent ~* “FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$” ) { return 403; } #禁止非GET|HEAD|POST方式的抓取 if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }然后,在网站相关配置中的 location / { 之后插入如下代码:
Shell
include agent_deny.conf;如下的配置:
Shell
[marsge@Mars_Server ~]$ cat /usr/local/nginx/conf/zhangge.conf location / { try_files $uri $uri/ /index.php?$args; #这个位置新增1行: include agent_deny.conf; rewrite ^/sitemap_360_sp.txt$ /sitemap_360_sp.php last; rewrite ^/sitemap_baidu_sp.xml$ /sitemap_baidu_sp.php last; rewrite ^/sitemap_m.xml$ /sitemap_m.php last;保存后,执行如下命令,平滑重启 nginx 即可:
Shell
/usr/local/nginx/sbin/nginx –s reload三、PHP 代码
将如下方法放到贴到网站入口文件 index.php 中的第一个