Blocking direct access to an URL (not a file) - .htaccess

A drupal site is pushing International traffic over quota on my (Plesk 10.4) server, and it looks as though much of that of that (~250,000 visits/month) is direct access to the URL /user/register. We are already using the botcha module to filter out spambot registrations, but that approach is resulting in two full pages being served to each bot. And while Drupal
I'm thinking that a .htaccess rule which returns a 403 response to that URL unless the referer is from the site might be the way to go, but my .htaccess-fu is not strong, and I can only find examples for blocking hot-linking of images.
What do I need to add and where?
Thanks,
Richard

You'd be checking against the HTTP referer. It's not a guarantee way to block incoming traffic linked from a site other than yours, since the field can be easily forged. But you can try adding this to the htaccess file (above any rules that are already there):
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?your-domain\com/ [NC]
RewriteRule ^user/register - [L,F]

Related

Block access for traffic to fake PDF pages

I have a lot of 404 hits to my site to PDF pages that have never existed on the site. These are all spammy-subject.pdf URLs. I get tens of these per day, which is much higher than genuine site traffic.
I'm currently adding 410 rewrites for each.
Can I use htaccess rule to totally block this traffic from reaching this site? Before it becomes a 404?
Can I use htaccess rule to totally block this traffic from reaching this site?
You can use .htaccess to prevent the request from being routed through a CMS such as WordPress, Joomla, etc. that uses a front-controller pattern - if that's what you mean by "site". However, the request has already reached your server by the time the .htaccess file is processed, so doing anything in .htaccess isn't necessarily going to help a "static site".
If you are already returning a 404 (or 410) - before it reaches your site - then the issue is already resolved.
The only potential issue is if the requests are being routed through your CMS and the 404 is being triggered by your CMS, not Apache. This would suggest you have the directives in the wrong place in your .htaccess file (or not present at all)? Blocking directives like this need to be at the top of your .htaccess file, before any existing rewrites.
For example:
# Prevent 404 request being routed unnecessarily through CMS
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule \.pdf$ - [NC,R=404]
There's no advantage to serving a 410 Gone instead of a 404 unless these files previously existed and you are trying to remove them from search engines (or telling 3rd parties they no longer exist).
UPDATE:
Should this code be at the very top or after the opening Wordpress rule: RewriteEngine On ?
It needs to be at the very top, before the # BEGIN WordPress comment marker (you should avoid manually editing the code in the WordPress section since WordPress itself maintains this section and your edits will be overwritten).
Yes, this is before the RewriteEngine On directive. You do not need to repeat the RewriteEngine directive. The location of the RewriteEngine directive does not actually matter. If there are multiple instances of this directive in the file then the last instance wins and controls the entire file. (It is a quick way to effectively comment out all the mod_rewrite directives in the file by simply placing a RewriteEngine Off directive at the very end.)

htaccess - Transform get parameter value in a subdomain

In my web application, I currently have URLs like this:
https://example.com/mypage?company=companyname&otherparameter=othervalue&...
I would like to transform the above URL this way:
https://companyname.example.com/mypage?otherparameter=othervalue&...
so basically transforming the value of the GET parameter "company" into a subdomain while preserving the other GET parameters in the URL (and preserving, obviously, also the path of the file on the server).
I also need to exclude the "/api" directory from this rule (so all files under the "/api" subdirectory should be served as usual).
I know I need to use .htaccess but I can't find a way to get it to work. If someone's got a hint, that would be very helpful.
Thanks!
This will capture the "subdomain name" from any incoming request and add it as query parameter to the internally rewritten target:
RewriteEngine on
RewriteCond %{HTTP_HOST} !^(?:www\.)?example\.com$
RewriteCond %{HTTP_HOST} !^([^.]+)?example\.com$
RewriteCond %{REQUEST_URI} !^/api/?
RewriteRule ^ %{REQUEST_URI}?company=%1 [QSA,L]
This will take care of handling incoming requests. This does not somehow magically change references you hand out, so links embedded in HTML markup or javascript for example.
You need to make sure that your http server actually responds to requests to those "subdomain" based host names. A default virtual host is usually used for such thing. You also need to take care that the DNS resolution of such names works and points towards your http server. And finally you have to provide a valid SSL certificate for all those host names. A wildcard certificate is an option here, but unlike normal certificates that does not come free of charge.
It is a good idea to implement such general rules in the actual host configuration of your http server. You can use a distributed configuration file for this (".htaccess"), but that comes with a few disadvantages.

Too many Rewrite Rules in .htaccess

I had to redesign a site last week. The problem is that last urls weren't seo friendly so, in order to avoid Google penalizing my site because too many 404 errors, I have to create a lot of Rewrite Rules because all the content had awful URL's ( and that content had a good position on SERP's).
For example:
RewriteRule ^documents/documents_for_subject/22-ecuaciones-exponenciales-y-logaritmicas http://%{HTTP_HOST}/1o-bachillerato/matematicas-cc.ss/aritmetica-y-algebra/ecuaciones-exponenciales-y-logaritmicas [R=301,L]
Is this a problem on my performance? Is there another solution to my situation?
Thanks
They are in the same domain.
Then an internal redirect is much better. A header redirect sends the new URL to the browser and causes it to make a new request; an internal one is handled, as the name says, internally.
This should work:
RewriteRule ^documents/documents_for_subject/22-ecuaciones-exponenciales-y-logaritmicas /1o-bachillerato/matematicas-cc.ss/aritmetica-y-algebra/ecuaciones-exponenciales-y-logaritmicas [L]
Any performance issues are going to be negligible with this - except maybe if you have many thousands or tens of thousands of individual rules, those may slow down Apache. In that case, if you have access to the central server configuration, put the rules there instead of a .htaccess file, because instructions in the server config get stored in memory and are faster.
A. Yes using 301 is the right way to notify search bots about changed URLs and eventually your old URL's will be removed from search results.
B. You don't need to use %{HTTP_HOST} in your rewrite rule just use it like this:
RewriteRule ^documents/documents_for_subject/22-ecuaciones-exponenciales-y-logaritmicas http://%{HTTP_HOST}/1o-bachillerato/matematicas-cc.ss/aritmetica-y-algebra/ecuaciones-exponenciales-y-logaritmicas [R=301,L]
C. If you have lots of RewriteRules like above I recommend using RewriteMap or else use some scripting support (like PHP) to redirect from old to new URL with 301.

Block all bots/crawlers/spiders for a special directory with htaccess

I'm trying to block all bots/crawlers/spiders for a special directory. How can I do that with htaccess? I searched a little bit and found a solution by blocking based on the user agent:
RewriteCond %{HTTP_USER_AGENT} googlebot
Now I would need more user agents (for all bots known) and the rule should be only valid for my separate directory. I have already a robots.txt but not all crawlers take a look at it ... Blocking by IP address is not an option. Or are there other solutions? I know the password protection but I have to ask first if this would be an option. Nevertheless, I look for a solution based on the user agent.
You need to have mod_rewrite enabled. Placed it in .htaccess in that folder. If placed elsewhere (e.g. parent folder) then RewriteRule pattern need to be slightly modified to include that folder name).
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule .* - [R=403,L]
I have entered only few bots -- you add any other yourself (letter case does not matter).
This rule will respond with "403 Access Forbidden" result code for such requests. You can change to another response HTTP code if you really want (403 is most appropriate here considering your requirements).
Why use .htaccess or mod_rewrite for a job that is specifically meant for robots.txt? Here is the robots.txt snippet you will need t block a specific set of directories.
User-agent: *
Disallow: /subdir1/
Disallow: /subdir2/
Disallow: /subdir3/
This will block all search bots in directories /subdir1/, /subdir2/ and /subdir3/.
For more explanation see here: http://www.robotstxt.org/orig.html
I Know the topic is "old" but still, for ppl who landed here also (as I also did),
you could look here great 5g blacklist 2013.
It's a great help and NO not only for wordpress but also for all other sites. Works awesome imho.
Another one which is worth looking at could be Linux reviews anti spam through .htaccess

URL/Subdomain rewrites (htaccess)

Say I have the following file:
http://www.example.com/images/folder/image.jpg
I want to serve it on
http://s1.example.com/folder/image.jpg
How can I do a htaccess rewrite to point it to it?
Like for example, I make a subdomain s1.example.com and then on that subdomain, I add a htaccess rule to point any files, to pull it from http://www.example.com/images/
Does serving files this way act as serving content from a cookieless domain?
First let me talk a bit about the concept of cookieless domains. Normally, when requesting anything over http, any relevant cookies are sent with the request. Cookies, are dependent on which domain they come from. The idea of using a cookieless domain is that you relocate static content that doesn't cookies, like images, to a separate domain so that no cookies will be sent with that request. This cuts out a small amount of traffic.
How much you gain from doing this depends on the type of page. The more images you have, the more you gain. If your site loads a big bunch of small images, such as avatars or image thumbnails, you might have a lot to gain. On the contrary, if your site doesn't use any cookies, you have nothing to gain. It's entirely possible that your page won't load noticeably faster, if it only uses a small amount of images, which will be cached between page loads anyway.
One thing to keep in mind, too, is that cookies set for example.com will also be sent with requests to s1.example.com as "s1." is a subdomain to example.com. You need to use www. (or any other subdomain of your choice) in order to separate the cookie spaces.
Secondly, if you decide that a cookieless domain is actually something worth trying, let's talk about the implementation.
Shikhar's solution is bad! While the solution appears to work on the surface, it actually defeats the purpose of using a cookieless domain. For every image, first the s1. url is tried. The s1. URL then makes a redirect to the www. domain which triggers a second http request. This is a loss, no matter how you look at it. What you need is a rewrite, which changes the URL internally on the web server, without the browser even realizing.
For simplicity, I'm assuming that all domains point to the same directory, so that www.example.com/something = example.com/something = s1.example.com/something = blub.example.com/something. This makes things simpler if you really need store the images physically in "www.example.com/images".
I'd recommend a .htaccess that looks a little something like this:
# Turn on rewrites
RewriteEngine On
# Rewrite all requests for images from s1, so they are fetched from the right place
RewriteCond %{HTTP_HOST} ^s1\.example\.com
# Prevent an endless loop from ever happening
RewriteCond %{REQUEST_URI} !^/images
RewriteRule (.+) /images/$1 [L]
# Redirect http://s1.example.com/ to the main page (in case a user tries it)
RewriteCond %{HTTP_HOST} ^s1\.example\.com
RewriteRule ^$ http://www.example.com/ [R=301,L]
# Redirect all requests with other subdomains, or without a subdomain to www.
# Eg, blub.example.com/something -> www.example.com/something
# example.com/something -> www.example.com/something
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteCond %{HTTP_HOST} !^s1\.example\.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
# Place any additional rewrites below.
Just for people's general info who like me may be investigating the benefits of this. From what I'm reading it isn't just cutting down on the upstream overhead of eliminating cookies sent with http requests. Apparently many browsers limit max connections to 1 domain/server to 6 concurrent. So if you have a separate domain on a diff server you get to double that to 12. Which to me would seem like the main potential here for a serious speed boost.
Though anyway, if I'm understanding this correctly. The other domain serving the static content needs to be located on another server from the main domain. Actually makes sense, avid firefox user and tweaker. When you check the about:config settings in firefox the max connections per server is set to 6 by default. A person can manually bump it up to a max of 8. But most firefox users probably don't spend enough time getting familiar with how to modify the browser and leave it to the default max of 6.
Not sure how many the other browsers set by default and then there is older browser versions that are still in use to consider. Bottomline ... makes perfect sense that enabling the browser to double the total number of connections using two servers would have to be a loadtime improvement. Using a sub-domain on the same server a person isn't going to be able to take advantage of that.
If you mean to redirect the traffic from www.example.com to s1.example.com, use the following htaccess on www.example.com
RewriteCond %{HTTP_HOST} ^(s1\.example\.com)
RewriteRule (.*) http://www.example.com%{REQUEST_URI}[R=301,NC,L]
If this is not what you are looking for, elaborate the question further.
I think you may have it backwards, (or very possibly I do). To clarify, if you're implementing a cookie-less subdomain & have a base URL of www. at least in this case, cookies are set on www, for example: a major cookie setter is google analytics, so when setting their script on my site it looks like this:
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'analytics-acc-#],
['_setDomainName', '[www.valpocreative.com][1]'],
['_trackPageview']);
You can see here that I set my main domain to www, correct me if i'm wrong in my case I would need to redirect www to non www subdomain & not the other way around. This is also the cname setup made on my cpanel (cname= "cdn" pointing to www.domain.com)

Resources