The Ultimate Generic .htaccess Wildcard Subdomain Rewrite to "www." - is this valid? - .htaccess

I think I have achieved the holy grail of generic wildcard subdomain redirection after going around in circles for a whole day battling "too many redirects" errors. It seems to work with any domain and subdomain, the only part you need to specify is a list of possible valid suffixes eg .com|.com.au|.co.uk etc. This code will take *yourdomain.suffix for any domain and turn it into http://www.yourdomain.suffix, but only for valid subdomains that could actually exist. You can have as many sequences of anything.anything-anything.anything-anything-anything.anything. before yourdomain.com as you want, it will all get turned into www. Now it seems to work perfectly, but I don't trust this sadistic language of regex one bit. I have absolutely no way of knowing if this code is valid, if it will cause server problems or fail under some important circumstances. Can anyone help bug-test or refine it?
Here it is:
RewriteCond %{HTTP_HOST} ^([a-zA-Z0-9]+[a-zA-Z0-9\-]*[a-zA-Z0-9]+[\.]{1}|[a-zA-Z0-9]+[\.]{1})*([a-zA-Z0-9]+[a-zA-Z0-9\-]*[a-zA-Z0-9]+|[a-zA-Z0-9]+)\.(com|com[\.]{1}au)?$ [NC]
RewriteCond %{HTTP_HOST} !^www\.([a-zA-Z0-9]+[a-zA-Z0-9\-]*[a-zA-Z0-9]+|[a-zA-Z0-9]+)\.(com|com[\.]{1}au)?$ [NC]
RewriteRule .? http://www.%2.%3%{REQUEST_URI} [R=302,NC,L]
Note: The reason it's so long is because I'm trying to account for the possibility of dashes in the main domain or subdomain parts. So anything-anything.youdomain.com. But I read that with domain names you're not allowed to have dashes without at least one alphanumerical character between the dash and any period. So www.anything-.yourdomain.com or www.-anything.yourdomain.com are both invalid and must be rejected. If I didn't have to consider this, the regex for the first 2 lines would be way simpler: it could just start with:
RewriteCond %{HTTP_HOST} ^([a-zA-Z0-9\-]+[\.]{1})*([a-zA-Z0-9\-]+)\.(com|com[\.]{1}au)?$

Related

.htaccess re-write rule breaks on value change

I have the following rule in my .htaccess file and it works fine
RewriteRule ^eat/?([a-zA-Z0-9-]+)?/?([a-zA-Z0-9-]+)?/?([0-9]+)?/?([a-zA-Z0-9-]+)?/?$ /incl/pages/details.php?state=$1&city=$2&ID=$3&name=$4 [NC,L]
I need to modify the rule where the ID ($3) will be a 3 character alpha-numeric (always in caps) value and not a 3 digits as it is now.
I've tried but my rule stops working:
RewriteRule ^eat/?([a-zA-Z0-9-]+)?/?([a-zA-Z0-9-]+)?/?([A-Z0-9]+)?/?([a-zA-Z0-9-]+)?/?$ /incl/pages/details.php?state=$1&city=$2&ID=$3&name=$4 [NC,L]
What am I missing?
To me this looks like you have an issue with a clear separation of the parameters. The way you use the question mark operator to implement a "lazy" rule able to rewrite a variable number of parameters will may issues if the capture groups match the same character sets.
Instead I suggest you specify separate rules for fixed numbers of parameters. That allows to not use that operator to such an extend:
RewriteEngine on
RewriteRule ^eat/([a-zA-Z0-9-]+)/([a-zA-Z0-9-]+)/([A-Z0-9]+)/([a-zA-Z0-9-]+)/?$ /incl/pages/details.php?state=$1&city=$2&ID=$3&name=$4 [NC,L]
RewriteRule ^eat/([a-zA-Z0-9-]+)/([a-zA-Z0-9-]+)/([A-Z0-9]+)/?$ /incl/pages/details.php?state=$1&city=$2&ID=$3 [NC,L]
RewriteRule ^eat/([a-zA-Z0-9-]+)/([a-zA-Z0-9-]+)/?$ /incl/pages/details.php?state=$1&city=$2 [NC,L]
RewriteRule ^eat/([a-zA-Z0-9-]+)/?$ /incl/pages/details.php?state=$1 [NC,L]
RewriteRule ^eat/?$ /incl/pages/details.php [NC,L]
And a question: to you really need the NC flag? Is eat really written is different cases? Because it will meddle with your attempt to differ between groups capturing only upper case letters or upper and/or lower case letters.
And a general hint: you should always prefer to place such rules inside the http servers host configuration instead of using dynamic configuration files (".htaccess"). Those files are notoriously error prone, hard to debug and they really slow down the server. They are only provided as a last option for situations where you do not have control over the host configuration (read: really cheap hosting service providers) or if you have an application that relies on writing its own rewrite rules (which is an obvious security nightmare).

Renaming and redirecting pages fails in htaccess

I am sorry to ask this question, because the answer seemingly is so easy. However, after three hours of trial and error I am without a clue.
I have several pages on a website using parameters in the url. I would like to change that, to a more regular url. Example:
domain.com/pag.php?id=1-awesome-page should become domain.com/awesome-page
So far so good, but so far I have three problems.
1. The old page still is accessible, Google will index it as duplicated content. When I try to redirect it, I am getting infinite loop errors.
2. For whatever reason, sometimes SOME images (straight from the content) get stripped off on the newly named page. I tried playing with a base-url and renaming the images and urls, but nothing so far.
3. Also the redirect doesn't care if i'd enter id=1-awesome-page or id=2-worthless-page. It all redirects to the first one.
Among the things i've tried.
RewriteCond %{QUERY_STRING} id=1-awesome-page
RewriteRule ^pag\.php$ /awesome-page? [L,R=301]
RewriteRule ^awesome-page?$ pag\.php?id=1 [NC]
What you want to do cannot really be done with mod_rewrite, unless you want to make a rule for every page, which will probably slow your site down quite a lot. This is, because you can't summon the 1 in 1-awesome-page out of thin air, and your pag.php page doesn't seem to be able to load the page only based on it's seo name. If you need to use that number, you need to have that number somewhere in your url.
As for your questions:
The error you mention cannot be reproduced with the current iteration of your .htaccess. You likely had an infinite loop previously, and since you use R=301 to test, the browser will cache this redirect and only request the second resource afterwards when you request the first resource. You should test with [R,L] and only change to [R=301,L] when everything works as expected. Not doing so will cause weird behaviour, and behaviour you do not expect with your .htaccess.
When you have an url a and an url b, and want to redirect a to b, and want to internally rewrite b to a, you need to make sure that any given time not both rules can be matched. You can either use the %{THE_REQUEST} trick or use the END flag. Both are outlined in this answer.
If you have a problem with resources on a page not loading after making a fancy url, you likely used relative url's. This question outlines the possibilities on how to resolve this. You can either make the url's absolute or relative to the root of your site, or use <base href="/">.
The following would work for /pag.php?id=123-news-page and /news/123/news-page.
RewriteCond %{THE_REQUEST} pag\.php\?.*id=([^-]+)-([^&\s]+)
RewriteRule ^pag\.php$ /news/%1/%2? [L,R]
RewriteRule ^news/([^/]+)/([^/]+)/?$ pag.php?id=$1-$2 [L]

Why does this cause an infinite request loop?

Earlier today, I was helping someone with an .htaccess use case, and came up with a solution that works but can't quite figure it out myself!
He wanted to be able to:
Browse to index.php?id=3&cat=5
See the location bar read index/3/5/
Have the content served from index.php?id=3&cat=5
The last two steps are fairly typical (usually from the user entering index/3/5 in the first place), but the first step was required because he still had some old-format links in his site and, for whatever reason, couldn't change them. So he needed to support both URL formats, and have the user always end up seeing the prettified one.
After much to-ing and fro-ing, we came up with the following .htaccess file:
RewriteEngine on
# Prevents browser looping, which does seem
# to occur in some specific scenarios. Can't
# explain the mechanics of this problem in
# detail, but there we go.
RewriteCond %{ENV:REDIRECT_STATUS} 200
RewriteRule .* - [L]
# Hard-rewrite ("[R]") to "friendly" URL.
# Needs RewriteCond to match original querystring.
# Uses "?" in target to remove original querystring,
# and "%n" backrefs to move its components.
# Target must be a full path as it's a hard-rewrite.
RewriteCond %{QUERY_STRING} ^id=(\d+)&cat=(\d+)$
RewriteRule ^index\.php$ http://example.com/index/%1/%2/? [L,R]
# Soft-rewrite from "friendly" URL to "real" URL.
# Transparent to browser.
RewriteRule ^index/(\d+)/(\d+)/$ /index.php?id=$1&cat=$2
Whilst it might seem to be a somewhat strange use case ("why not just use the proper links in the first place?", you might ask), just go with it. Regardless of the original requirement, this is the scenario and it's driving me mad.
Without the first rule, the client enters into a request loop, trying to GET /index/X/Y/ repeatedly and getting 302 each time. The check on REDIRECT_STATUS makes everything run smoothly. But I would have thought that after the final rule, no more rules would be served, the client wouldn't make any more requests (note, no [R]), and everything would be gravy.
So... why would this result in a request loop when I take out the first rule?
Without being able to tinker with your setup, I can't say for sure, but I believe this problem is due to the following relatively arcane feature of mod_rewrite:
When you manipulate a URL/filename in per-directory context mod_rewrite first rewrites the filename back to its corresponding URL (which is usually impossible, but see the RewriteBase directive below for the trick to achieve this) and then initiates a new internal sub-request with the new URL. This restarts processing of the API phases.
(source: mod_rewrite technical documentation, I highly recommend reading this)
In other words, when you use a RewriteRule in an .htaccess file, it's possible that the new, rewritten URL maps to an entirely different directory on the filesystem, in which case the .htaccess file in the original directory wouldn't apply anymore. So whenever a RewriteRule in an .htaccess file matches the request, Apache has to restart processing from scratch with the modified URL. This means, among other things, that every RewriteRule gets checked again.
In your case, what happens is that you access /index/X/Y/ from the browser. The last rule in your .htaccess file triggers, rewriting that to /index.php?id=X&cat=Y, so Apache has to create a new internal subrequest with the URL /index.php?id=X&cat=Y. That matches your earlier external redirect rule, so Apache sends a 302 response back to the browser to redirect it to /index/X/Y/. But remember, the browser never saw that internal subrequest; as far as it knows, it was already on /index/X/Y/. So it looks to you as though you're being redirected from /index/X/Y/ to that same URL, triggering an infinite loop.
Besides the performance hit, this is probably one of the better reasons that you should avoid putting rewrite rules in .htaccess files when possible. If you move these rules to the main server configuration, you won't have this problem because matches on the rules won't trigger internal subrequests. If you don't have access to the main server configuration files, one way you can get around it (EDIT: or so I thought, although it doesn't seem to work - see comments) is by adding the [NS] (no subrequest) flag to your external redirect rule,
RewriteRule ^index\.php$ http://example.com/index/%1/%2/? [L,R,NS]
Once you do that, you should no longer need the first rule that checks the REDIRECT_STATUS.
The solution below worked for me.
RewriteEngine on
RewriteBase /
#rule1
#Guard condition: only if the original client request was for index.php
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php [NC]
RewriteCond %{QUERY_STRING} ^id=(\d+)&cat=(\d+)$ [NC]
RewriteRule . /index/%1/%2/? [L,R]
#rule 2
RewriteRule ^index/(\d+)/(\d+)/$ /index.php?id=$1&cat=$2 [L,NC]
Here is what I think is happening
From the steps you quoted above
Browse to index.php?id=3&cat=5
See the location bar read index/3/5/
Have the content served from index.php?id=3&cat=5
At Step 1, Rule 1 matches and redirects to location bar and fulfills Step 2.
At Step 3, Rule 2 now matches and rewrites to index.php.
The rules are rerun, for the reasons David stated, but since THE_REQUEST is immutable once set to the original request, it still contains /index/3/5 so Rule 1 does not match.
Rule 2 does not match either and the result of index.php is served.
Most other variables are mutable e.g. REQUEST_URI. Their modification during rule processing, and the incorrect expectation that the pattern matches are against the original request is a common reason for infinite loops.
Its feels quite esoteric sometimes, but I am sure there is a logical reason for its complexity :-)
EDIT
Surely there are two distinct requests
There are 2 client requests, the original one from Step1 and the one from the external redirect in step 2.
What I glossed over above is that when Rule 2 matches on the second request, it is rewritten to /index.php and causes an internal redirect. This forces the .htaccess file for / directory to be loaded again (it could easily have been another another directory with different .htaccess rules) and Re-run all the rules again.
So... why would this result in a request loop when I take out the first rule?
When the rules are re-run, the first rule now unexpectedly matches, as a result of Rule2's rewrite, and does a redirect, causing an infinite loop.
David's answer does contain most of this information and is what I meant "for the reasons David stated".
However, the main point here is that you do need the extra condition, either your condition, which stops further rule processing on internal redirects, or mine, which prevents rule 1 from matching, is necessary to prevent the infinite loop.

mod_rewrite Redirect Rule Variables question

I'm a bit of an .htaccess n00b, and can't for the life of me get a handle of regular expressions.
I have the following piece of RewriteRule code that works just fine:
RewriteRule ^logo/?$ /pages/logo.html
Basically, it takes /pages/logo.html and makes it /logo.
Is there a way for me to generalize that code with variables, so that it works automatically without having to have an independent line for each page?
I know $1 can work as a variable, but thats usually for queries, and I can't get it to work in this instance.
First you need to know that mod_rewrite can only handle requests to the server. So you would need to request /logo to have it rewritten to /pages/logo.html. And that’s what the rule does, it rewrites requests with the URL path /logo internally to /pages/logo.html and not vice versa.
If you now want to use portions of the matched string, you need to use groups to group them ( (expr)) that you then can reference to with $n. In your case the pattern [^/] will be suitable that describes any character other than the slash /:
RewriteRule ^([^/]+)$ /pages/$1.html
Try this:
RewriteRule ^/pages/(.*)\.html$ /$1
The (.*) matches anything between pages/ and .html. Whatever it matches is used in $1. So, /pages/logo.html becomes /logo, and /pages/subdir/other_page.html would become /subdir/other_page

Redirecting non-www URL to www using .htaccess

I'm using Helicon's ISAPI Rewrite 3, which basically enables .htaccess in IIS. I need to redirect a non-www URL to the www version, i.e. example.com should redirect to www.example.com. I used the following rule from the examples but it affects subdomains:
RewriteCond %{HTTPS} (on)?
RewriteCond %{HTTP:Host} ^(?!www\.)(.+)$ [NC]
RewriteCond %{REQUEST_URI} (.+)
RewriteRule .? http(?%1s)://www.%2%3 [R=301,L]
This works for most part, but is also redirect sub.example.com to www.sub.example.com. How can I rewrite the above rule so that subdomains do not get redirected?
Append the following RewriteCond:
RewriteCond %{HTTP:Host} ^[^.]+\.[a-z]{2,5}$ [NC]
That way it'll only apply the rule to nondottedsomething.uptofiveletters as you can see, subdomain.domain.com will not match the condition and thus will not be rewritten.
You can change [a-z]{2,5} for a stricter tld matching regex, as well as placing all the constraints for allowed chars in domain names (as [^.]+ is more permissive than strictly necessary).
All in all I think in this case that wouldn't be necessary.
EDIT: sadie spotted a flaw on the regex, changed the first part of it from [^.] to [^.]+
I've gotten more control using urlrewriter.net, something like:
<unless header="Host" match="^www\.">
<if url="^(https?://)[^/]*(.*)$">
<redirect to="$1www.domain.tld$2"/>
</if>
<redirect url="^(.*)$" to="http://www.domain.tld$1"/>
</unless>
Zigdon has the right idea except his regex isn't quite right. Use
^example\.com$
instead of his suggestion of:
^example\.com(.*)
Otherwise you won't just be matching example.com, you'll be matching things like example.comcast.net, example.com.au, etc.
#Vinko
For your generic approach, I'm not sure why you chose to limit the length of the TLD in your regex? It's not very future-proof, and I'm unsure what benefit it's providing? It's actually not even "now-proof" because there's at least one 6-character TLD out there (.museum) which won't be matched.
It seems unnecessary to me to do this. Couldn't you just do ^[^.]+\.[^.]\+$? (note: the question-mark is part of the sentence, not the regex!)
All that aside, there is a bigger problem with this approach that is: it will fail for domains that aren't directly beneath the TLD. This is domains in Australia, UK, Japan, and many other countries, who have hierarchies: .co.jp, .co.uk, .com.au, and so on.
Whether or not that is of any concern to the OP, I don't know but it's something to be aware of if you're after a "fix all" answer.
The OP hasn't yet made it clear whether he wants a generic solution or a solution for a single (or small group) of known domains. If it's the latter, see my other note about using Zigdon's approach. If it's the former, then proceed with Vinko's approach taking into account the information in this post.
Edit: One thing I've left out until now, which may or may not be an option for you business-wise, is to go the other way. All our sites redirect http://www.domain.com to http://domain.com. The folks at http://no-www.org make a pretty good case (IMHO) for this being the "right" way to do it, but it's still certainly just a matter of preference. One thing is for sure though, it's far easier to write a generic rule for that kind of redirection than this one.
#org 0100h Yes, there are many variables left out of the description of the problem, and all your points are valid ones and should be addressed in the event of an actual implementation. There are both pros and cons to your proposed regex. On the one hand it's easier and future proof, on the other, do you really want to match example.foobar if sent in the Host header? There might be some edge cases when you'll end up redirecting to the wrong domain. A thrid alternative is modifying the regex to use a list of the actual domains, if more than one, like
RewriteCond %{HTTP:Host} (example.com|example.net|example.org) [NC]
(Note to chris, that one will change %1)
#chrisofspades It's not meant to replace it, your condition number two ensures that it doesn't have www, whereas mine doesn't. It won't change the values of %1, %2, %3 because it doesn't store the matches (iow, it doesn't use parentheses).
Can't you adjust the RewriteCond to only operate on example.com?
RewriteCond %{HTTP:Host} ^example\.com(.*) [NC]
Why dont you just have something like this in your vhost (of httpd) file?
ServerName: www.example.com
ServerAlias: example.com
Of course that wont re-direct, that will just carry on as normal

Resources