I'm having trouble blocking two bad bots that keep sucking bandwidth from my site and I'm certain it has something to do with the * in the user-agent name that they use.
Right now, I'm using the following code to block the bad bots (this is an excerpt)...
# block bad bots
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^spider$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^robot$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^crawl$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^discovery$
RewriteRule .* - [F,L]
When I try to do RewriteCond %{HTTP_USER_AGENT} ^*bot$ [OR] or RewriteCond %{HTTP_USER_AGENT} ^(*bot)$ [OR] I get an error.
Guessing there is a pretty easy way to do this that I just haven't found yet on Google.
An asterisk (*) in a regular expression pattern needs to be escaped, since it is being interpreted as part of the regular expression.
RewriteCond %{HTTP_USER_AGENT} ^\*bot$
should do the trick.
I think your are missing a dot ., change your condition to this:
RewriteCond %{HTTP_USER_AGENT} ^.*bot$ [OR]
But how is this going to prevent Bad Bot access?
I work for a security company (also PM at Botopedia.org) and I can tell that 99.9% of bad bots will not use any of these expressions in their user-agent string.
Most of the time Bad Bots will use legitimate looking user-agents (impersonating browsers and VIP bots like Googlebot) and you simply cannot filter them via user-agent data alone.
For effective bot detection you should look into other signs like:
1) Suspicious signatures (i.e. Order of Header parameter)
or/and
2) Suspicious behavior (i.e. early robots.txt access or request rates/patterns)
Then you should use different challenges (i.e. JS or Cookie or even CAPTCHA) to verify your suspicions.
The problem you've described is often referred to as a "Parasitic Drag".
This is a very real and serious issue and we actually published a research about it just a couple of month ago.
(We found that on an average sized site 51% of visitors will be bots, 31% malicious)
Honestly, I don't think you can solve this problem with several line of RegEx.
We offer our Bot filtering services for free and there are several others like us. (I can endorse good services if needed)
GL.
Related
I would like to block some specific pages from being indexed / accessed by Google. This pages have a GET parameter in common and I would like to redirect bots to the equivalent page without the GET parameter.
Example - page to block for crawlers:
mydomain.com/my-page/?module=aaa
Should be blocked based on the presence of module= and redirected permanently to
mydomain.com/my-page/
I know that canonical can spare me the trouble of doing this but the problem is that those urls are already in the Google Index and I'd like to accelerate their removal. I have already added a noindex tag one month ago and I still see results in google search. It is also affecting my crawl credit.
What I wanted to try out is the following:
RewriteEngine on
RewriteCond %{QUERY_STRING} module=
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
Is this correct?
What should I add for the final redirection?
It's a tricky thing to do so before implementing anything I'd like to make sure it's the right thing to do.
Thanks
That would be:
RewriteEngine On
RewriteCond %{QUERY_STRING} module= [NC]
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule ^ %{REQUEST_URI}? [L,R=301]
Last ? in %{REQUEST_URI}? will remove previous query string.
From every example on the net it seems this is the config to use to block referrer spam. I am still getting traffic from trafficmonetize.org. Can anyone tell me or give me some ideas what to look for.
## SITE REFERRER BANNING
RewriteCond %{HTTP_REFERER} semalt\.com [NC,OR]
RewriteCond %{HTTP_REFERER} best-seo-offer\.com [NC,OR]
RewriteCond %{HTTP_REFERER} 100dollars-seo\.com [NC,OR]
RewriteCond %{HTTP_REFERER} buttons-for-website\.com [NC,OR]
RewriteCond %{HTTP_REFERER} buttons-for-your-website\.com [NC,OR]
RewriteCond %{HTTP_REFERER} seoanalyses\.com [NC,OR]
RewriteCond %{HTTP_REFERER} 4webmasters\.org [NC,OR]
RewriteCond %{HTTP_REFERER} trafficmonetize\.org [NC]
RewriteRule .* - [F]
I spent a week dealing with referral bots spamming sites. The first line defense was doing it via the htaccess file, however bots where still able to get through and hitting my Google Analytics account.
The reason some of these bots are hitting your site is because they are in fact not actually visiting your website. They are taking your Google Analytics tracker code, and placing it within a JavaScript on their servers and pinging it which is causing false pageviews.
The best solution that I came up with, was simply filtering them out in my Google Analytics account. Here is the Moz article that I used as a reference. Since adding the filter, the bots no longer appear in my Analytics stats.
Server solutions like the .htaccess file will only work for Crawler spam, from your list
semalt
100dollars-seo
buttons-for-website
buttons-for-your-website
Ghost spam like 4webmasters and trafficmonetize never access your site, so there is no point on trying to block it from the .htaccess file, it all happens within GA so has to be filtered out there, that's why it keeps showing on your reports.
As for seoanalyses I'm not sure since I haven't seen it on any of the properties I manage, but you can see it for yourself, select as a second dimension hostname and if you see a fake hostname or not set then is ghost spam, if it has a valid hostname then is crawler. Either way you can filter it.
You can use 2 approaches for filtering spam, one is creating a Campaign Source excluding the referral or a more advanced approach is to create a Valid hostname filter that will get rid of all Ghost Spam
Here you can find more information about the spam and both solutions:
https://stackoverflow.com/a/28354319/3197362
https://stackoverflow.com/a/29717606/3197362
I would like to allow a robot with the user agent ECLoadToEdge/383175. Since I cannot confirm if the 6 numbers will change, I intend to use an asterisk.
May I know the difference between:
RewriteCond %{HTTP_USER_AGENT} !^ECLoadToEdge\*$
and
RewriteCond %{HTTP_USER_AGENT} !^ECLoadToEdge.*$
Would it be better to use !^ECLoadToEdge.[0-9]{6} instead of * for performance?
This rule is wrong:
RewriteCond %{HTTP_USER_AGENT} !^ECLoadToEdge\*$
SInce it will try to match literal asterisk in user agent.
You should just use:
RewriteCond %{HTTP_USER_AGENT} !^ECLoadToEdge
Since you don't care what comes after ECLoadToEdge
I have built a Mobile site in a sub-domain.
I have successfully implemented the redirect 302 from:
www.domain.com to m.domain.com in htaccess.
What I'm looking to achieve now it to redirect users from:
www.domain.com/internal-page/ > 302 > m.domain.com/internal-page.html
Notice that URL name for desktop and mobile is not the same.
The code I'm using looks like this:
# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>
# END WordPress
# Mobile Redirect
# Verify Desktop Version Parameter
RewriteCond %{QUERY_STRING} (^|&)ViewFullSite=true(&|$)
# Set cookie and expiration
RewriteRule ^ - [CO=mredir:0:www.domain.com:60]
# Prevent looping
RewriteCond %{HTTP_HOST} !^m.domain.com$
# Define Mobile agents
RewriteCond %{HTTP_ACCEPT} "text\/vnd\.wap\.wml|application\/vnd\.wap\.xhtml\+xml" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "sony|symbian|nokia|samsung|mobile|windows ce|epoc|opera" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "mini|nitro|j2me|midp-|cldc-|netfront|mot|up\.browser|up\.link|audiovox"[NC,OR]
RewriteCond %{HTTP_USER_AGENT} "blackberry|ericsson,|panasonic|philips|sanyo|sharp|sie-"[NC,OR]
RewriteCond %{HTTP_USER_AGENT} "portalmmm|blazer|avantgo|danger|palm|series60|palmsource|pocketpc"[NC,OR]
RewriteCond %{HTTP_USER_AGENT} "smartphone|rover|ipaq|au-mic,|alcatel|ericy|vodafone\/|wap1\.|wap2\.|iPhone|android"[NC]
# Verify if not already in Mobile site
RewriteCond %{HTTP_HOST} !^m\.
# We need to read and write at the same time to set cookie
RewriteCond %{QUERY_STRING} !(^|&)ViewFullSite=true(&|$)
# Verify that we previously haven't set the cookie
RewriteCond %{HTTP_COOKIE} !^.*mredir=0.*$ [NC]
# Now redirect the users to the Mobile Homepage
RewriteRule ^$ http://m.domain.com [R]
RewriteRule $/internal-page/ http://m.domain.com/internal-page.html [R,L]
At the end, you have two RewriteRule lines which I believe should be changed to:
RewriteRule ^\/?$ http://m.domain.com [R=302]
RewriteRule ^\/?(.*)\/?$ http://m.domain.com/$1.html [R=302,L]
The ^\/?(.*)\/?$ means give me a string that starts at the beginning (^) and gives me all characters ((.*)) until the end ($) without the trailing/beginning (/) if there is one (?).
The http://m.domain.com/$1.html means that if the address is http://www.domain.com/internal-page/ then it becomes http://m.domain.com/internal-page.html.
The [R=302,L] should mean a 302 redirect (R=302) and the last rewrite (L), so no other rewrites can occur on our URL.
EDIT:
I believe that in the case of your RewriteRules the first one was redirecting to http://m.domain.com in the event that the URL was just the domain, but if there was anything else then the second rewrite was failing because it was not actually literally /internal-page/ and you needed a regex variable to put into the new URL.
EDIT (2):
To redirect to each mobile page from a specific desktop page:
RewriteRule ^\/foo\/?$ http://m.domain.com/bar.html [R=302]
RewriteRule ^\/hello\/?$ http://m.domain.com/world.html [R=302]
The (/?) means that a / is optional in that position and the (^) denotes beginning and ($) denotes ending in this case (the ^ can also be used to indicated something like [^\.] which means anything except a period).
Just put how ever many of those that you need in a row to do the redirecting and that should do the trick. To make sure there are no misconceptions, the first line would mean that http://www.domain.com/foo/ would become http://m.domain.com/bar.html and because the trailing slash is made optional http://www.domain.com/foo (notice the trailing forward slash is absent) would also redirect to http://m.domain.com/bar.html.
You can play with the syntax a bit to customize it, but hopefully I've pointed you in the right direction. If you need anything else, let me know, I'll do my best to assist.
I don't want to sound like a broken record or anything, but I feel that I could not, in good conscience, end this edit without pointing out that modifying the mobile site would be a much better way to do this. If it is not possible or you feel that a few static redirects are not a big deal versus modifying some pages, then I totally understand, but here are a few things for you to think about:
If the mobile site and desktop site are in separate folders then the exact same name scheme can be used for both making the Rewrites simpler and meaning that as new pages/content are added you will not need more Rewrite statements (making more rewrites means you have to create the new pages and then you have to create the redirects. that's extra work and more files which require your attention.)
If the mobile site is actually hosted from the same directory as the desktop site, then changing the files for one or the other so it becomes something like /desktop-foo/ or /d-foo/ then it is very easy to make the rewrite (redirect) go to something like /m-foo.html. You could forego modifying the desktop pages and make /foo/ become /m-foo.html and make all your mobile versions begin with an 'm'.
The third option that comes to mind is the most difficult and time consuming, depending on the content of the site, but it is a pretty cool one and ultimately would make the site the easiest to work on (after the initial work, of course). It is quite possible to use the same page for desktop, mobile, tablet, etc without the use of mod_rewrite or separate pages. Things like media queries in your CSS would allow you to change the look of the page depending on what the client is viewing it from. I came across a tutorial on the subject earlier which used media queries and the max-width of the screen to determine how the page should look. This would require a good bit of work now, but could save some hassle down the road as well as being an interesting learning experience if you are up to the challenge.
Again, sorry that this veered off topic at the end there, but I got the impression from your original question and your responses that you might find the alternatives interesting if you haven't already considered and dismissed them and that even if the alternatives do not interest you that you aren't going to be like some people and respond with, "Hey, $*%& you, buddy! I asked for Rewrites not all that other garbage!" I hope you take it as nothing more than what it is intended to be...helpful.
I'm trying to redirect images on my server to a url, if the user client is NOT A BOT.
So far I have:
RewriteCond %{HTTP_USER_AGENT} "Windows" [NC]
RewriteCond %{REQUEST_URI} jpg
RewriteRule ^(.*)$ http://www.myurl.com/$1 [R=301,L]
But something is wrong. Is it possible to combine these 2 conditions?
Your idea is admirable, but the logic is flawed based on real world bot behavior.
I deal with security on sites all the time & User Agent strings are faked all the time. If have an option to install it, I would recommend using a tool like Mod Security. It’s basically an Apache module firewall that uses configurable rulesets to deny bad patterns of access behavior. But honestly, if you are having issues with .htaccess stuff like this Mod Security might be too intense to understand.
A better tact is to just prevent hot-linking via mod_rewrite tricks like this.
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain.com/.*$ [NC]
RewriteRule \.(gif|jpg)$ http://www.mydomain.com/angryman.gif [R,L]
Then again, reading your question I am not 100% sure what you want to achieve? Maybe mod_rewrite stuff like this can give you hints on how to approach the issue? Good luck!