Update September 2022: Now using WordPress; and Wordfence plugin doing much of this work. Anyhoo, in case helpful to anyone:
I’ve been having problems with high server load – my webhost figures there could be an issue with Joomla (excessive MySQL queries), or perhaps due to excessive spidering of the site, including by miscreant robots that aren’t google, Inktomi or other of the more acceptable suspects. I’ve tried robots.txt with crawl-delays for google, msn etc, while disallowing a fairly long list of other spiders. But still a problem.
Now using a script written by members of Webmasterworld [since discontinued, after move to another webhost, where had a bit of trouble resulting from changes to file times] – initially by Xicus, later modified a little by Giacomo, and since modified and tweaked by AlexK, with input from various members.
Seems to most readily apply to php sites; but can also be used on sites with static pages.
Should find robots that are hammering the site, maybe busily scraping content; after identifying them, should stop them.
The first night I tried the script, found log file reporting sogou spider visiting in bursts, with ten second intervals between visits during each burst.
Added to a couple of other sites, and here too has reported finding bad bots – both slow and fast scrapers, two or three of them making up to 5 requests a second. One of these sites only around three months old, and with few links, yet managed to attract a scraper or two: I wonder if as there are links from 2 pages on Wikipedia, which itself is a hot target for scrapers. (Hadn’t occurred to me that they’d also go on to scrape sites in external links, but with hindsight seems logical, partly as link from Wikipedia more likely to be useful than just some random link from wherever.)
– just maybe, has helped reduce server load, as my webhost not seeing major issues for a few days. Fingers crossed!
Much info to wade through on Webmasterworld – spans three or four threads, including:
Blocking badly behaved runaway WebCrawlers
Blocking badly behaved bots #3
Basic instructions simple enough; indeed, are some instructions in the comments within php file.
But, took me – a php dummy – quite some time to figure just what to do. Lest you want to try the script, maybe useful to draw on this info from a post by Giacomo:
1. Copy the above code, including the
?>delimiters, paste it into an empty text file and save it to your web site’s document root folder (normally,
public_html/) as “botblocker.php” (or any other name you like).
2. Create a new directory inside the
public_htmlfolder and name it
"botblocker"; make sure the directory is writable by the server (chmod it 777). In case you do not know how to chmod a directory, use the following script:
chown ($_SERVER["DOCUMENT_ROOT"]."/botblocker/", nobody);
chmod ($_SERVER["DOCUMENT_ROOT"]."/botblocker/", 0777);
3. Make sure you include the script on top of all of your PHP pages, like this:
Alternatively, as suggested by DrDoc, you may add the following to your .htaccess file:
php_value auto_prepend_file "/path/to/file/botblocker.php"
The latter method requires that you replace “/path/to/file” with the actual path to your web site’s document root folder. You can easily get the path with the following PHP script:
<?php echo $_SERVER["DOCUMENT_ROOT"];?>
That’s all. 😉
Note: I tried as “require_once” for pages on this Joomla site – by adding code to my template – but seemed it wasn’t working – nothing happening in the folder the script was supposed to use. I instead added the lines to “IfModule…” lines to my htaccess, and worked after that. This site includes some static, html only pages – and seems they’re still working fine.
If you want to use on a site with html pages and no php support, best to hunt around Webmasterworld forums for instructions from AlexK.
I’ve been in touch with AlexK re the script, and the plethora of info, suggesting he might write summary of info and instructions: he said he’ll do so once time. Hopefully, as and when he does so, I’ll update this page with link.