what does $1 in .htaccess file mean? - .htaccess

I am trying to understand the meaning of this line in the .htaccess file
RewriteRule ([a-z0-9/-]+).html $1.php [NC,L,QSA]
basically what does $1.php ? what file in the server
if we have home.html where this gonna redirect to? home.php?

$1 is the first captured group from your regular expression; that is, the contents between ( and ). If you had a second set of parentheses in your regex, $2 would contain the contents of those parens. Here is an example:
RewriteRule ([a-z0-9/-]+)-([a-z]+).html$ $1-$2.php [NC,L,QSA]
Say a user navigates to hello-there.html. They would be served hello-there.php. In your substitution string, $1 contains the contents of the first set of parens (hello), while $2 contains the contents of the second set (there). There will always be exactly as many "dollar" values available in your substitution string as there are sets of capturing parentheses in your regex.
If you have nested parens, say, (([a-z]+)-[a-z]+), $1 always refers to the outermost capture (in this case the whole regex), $2 is the first nested set, and so on.

.htaccess files can contain a wide variety of Apache configuration directives, but this one, like many, is to do with the URL rewriting module, mod_rewrite.
A RewriteRule directive has 3 parts:
a Pattern (regular expression) which needs to match against the current URL
a Substitution string, representing the URL to serve instead, or instruct the browser to redirect to
an optional set of flags
In this case, you have a regular expression which matches anything ending in .html which consists only of letters a-z, digits 0-9, / and -. However, it also contains a set of parentheses (...), which mark a part of the pattern to be "captured".
The Substitution string can then reference this "captured" value; the first capture is $1, and the second would be $2, and so on.
In this case, the captured part is everything before the .html, and the Substitution is $1.php, meaning whatever string came before .html is kept, but the .html is thrown away and .php is stuck on instead.
So for your specific example, accessing home.html will instead act as though you had requested home.php.

It's a reference to the first capture group denoted by the parentheses in the pattern ([a-z0-9/-]+).html$. If there were two (.*)-(.*) then you would access $1 for the first capture group and $2 for the second, etc...

$1 refers to the first group caught by your regex (ie between parenthesis). In your case it refers to :
([a-z0-9/-]+)
For the URL mypage.html, $1 will contain "mypage", and the rule will redirect to mypage.php.

Related

Is it possible to put a line break into a 'mailto:' rewrite in htaccess?

For complex reasons I've had to remove an enquiry form from a web site and use a 'mailto:' instead. For simplicity I've changed the htaccess file so that the former 'contact' link to the form now becomes a 'mailto:' as follows:
RewriteRule ^contact$ mailto:myname#mydomain.com?subject=BusinessName\ BandB\ Enquiry&body=You\ can\ find\ our\ availability\ on\ line.\ Delete\ this\ content\ if\ inapplicable
That does work, my local e-mail client (Thunderbird) opens with the information correctly shown in subject and body. (My TB is set to compose in plain text, I've yet to test with HTML)
I would like to introduce a new line in the body so that 'Delete this content if inapplicable' is on a separate line. Is there any way to do this? Given mod_rewrite's intended purpose I could understand if there isn't but I thought I'd ask before giving up.
I would like to introduce a new line in the body so that 'Delete this content if inapplicable' is on a separate line.
New lines in the body need are represented by two characters: carriage return (char 13) + line feed (char 10) (see RFC2368). This would need to be URL encoded in the resulting URL as %0D%0A.
When used in the RewriteRule substitution string the literal % characters would need to backslash-escaped to negate their special meaning as a backreference to the preceding CondPattern (which there isn't one). ie. \%0D\%0A. Otherwise, you will end up with the string DA, because there is no %0 backreference in this example.
You can also avoid having to backslash-escape all the literal spaces by encloses the entire argument (substitution string) in double quotes.
So, try the following instead:
RewriteRule ^contact$ "mailto:myname#mydomain.com?subject=BusinessName BandB Enquiry&body=You can find our availability on line.\%0D\%0ADelete this content if inapplicable" [R,L]

RewriteRule cuts off part of a variable name

I have a RewriteRule inside my .htaccess file:
RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]
For most cases it works fine, however, if the title starts with the word "club" that word is cut off.
For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars
In the following URL:
mysite.com/tx/travis/club/fast-cars-club
$_GET['title'] == 'fast-cars-club'
But in the this URL:
mysite.com/tx/travis/club/club-fast-cars
$_GET['title'] == '-fast-cars'
What am I missing?
Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:
^ start of string
[a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
/ a literal slash, not optional, not captured
([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
/club the literal string /club, not optional, not captured
/? a literal slash, optional (specifically, ? means must occur zero or one times)
[a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)? the above capture group as a whole is optional
/? a literal slash, optional
$ end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
/club is followed by /, so /? matches
fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
there is no following /, but that's fine: /? can match zero occurrences
the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
exactly two letters (the state) [a-zA-Z]{2}
/
one or more letters or hyphens (the county) [a-zA-Z-]+
/
the literal word club
/
one or more letters or hyphens (the title) [a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.
Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.

Removal of trailing dot in RewriteRule of .htaccess

The .htaccess rewrite rule applied in a restful database application:
RewriteRule ^author/([A-z.]+)/([A-z]+)$ get_author.php?first_name=$1&last_name=$2
applied to
http://localhost:8080/API/author/J./Doe
removes the period from "J." and the resulting name "J Doe" is obviously not in the database (while "J. Doe" is). This rewrite rule only removes a trailing period, e.g. "J.O" translates correctly to "J.O". I use XAMPP 7.0.6 plus Apache under Windows 10. What to do in order to NOT remove the trailing dot on the initial?
Update:
Apparently my question wasn't clear, I give it another try.
The regexp (RewriteRule) above is supposed to assign "J." to the variable $1. Instead it assigns "J" to $1, in other words, the regex drops the trailing dot. Secondly, the regex assigns "Doe" to the variable $2, this assignment is as expected and correct. The variables $1 (with incorrect value "J") and $2 (with correct value "Doe") are used in a database search. This search fails because of the missing dot. The database contains "J. Doe", but not "J Doe".
When a dot is not trailing, as in "J.O", the variable $1 gets the correct value "J.O". In other words, the regex does not remove all dots, only the trailing ones.
My question is: how can I tell (the rewrite engine of) .htaccess to apply the regexp correctly?
For comparison, the following piece of JS code does what I want:
var regexp = "^author/([A-z.]+)/([A-z]+)$";
var result = "author/J./Doe".match(regexp);
alert(result[1] + " " + result[2]);
This is apparently (still) a "feature": https://bz.apache.org/bugzilla/show_bug.cgi?id=20036
Problem: Apache strips all trailing dots and spaces unless the path segments is exactly "." or "..".
I ran into the problem because I tried to map an URL from get/a/b/c to get.php?param1=a&param2=b&param3=c, but c can legitimately have trailing dots. The issue is not actually mod_rewrite related but happens with regular URLs too, example URL of a file that's definitely not named this way: Example favicon file. Other servers don't do this. Example: Stackoverflow favicon file, which turns this into a way to detect an Apache server when the HTTP server header is stripped.
To work around this problem, I still map the URL using mod_rewrite, but then in the PHP script, I use the exact same regex to manually map the parameters:
if(preg_match('#/get/([^/]+)/([^/]+)/(.+)$#',$_SERVER['REQUEST_URI'],$matches)){
$param1=$matches[1];
$param2=$matches[2];
$param3=$matches[3];
}
Instead of using the PATH_INFO, I use the REQUEST_URI because it's untouched.
This means if you absolutely need to pass trailing dots in a path string to a backend using apache, your best bet right now is to write an intermediate script that extracts the proper parameters and then does the proxy request for you.

mod_rewrite RewriteRule backreference use in pattern

Is it possible to use a backreference in the middle of the RewriteRule pattern?
I envision this as such:
RewriteRule ^(.)([a-z]*)$1([a-z]*) $2-$3
^
(note, this fellow, here, refers to that first "(.)")
If the received url is XboopXdoop, the result would of course be boop-doop.
I am attempting to use this to specify a delimiter at the beginning of the incoming url that can be used to parse the rest of the string, without forcing the use of a specific character as that delimiter.
Thank you.
$1 works on the right side (rewrite), but not in the regex. You need to use \1.
Try:
RewriteRule ^(.)([a-z]+)\1([a-z]+) $2-$3
I ran into a bizarre edge case with the * where it split based on the second character of the string, and not the second. XtestingXtest resulted in es-ing ... so yeah, not sure what was happening there. If I use a + it works fine.
Also, since * and + are greedy, if you have multiple delimiter characters, it will split on the last occurrence of the character:
XbaseXtest -> base-test
XbaseXteXst -> baseXte-st
XbaseXtestX -> baseXtest-

Two RewriteRules interfering, only one works at a time

On htaccess file with two RewriteRules, each works alone, but not both together
RewriteRule ^([1-9]+)/.*/(.*) /sortir/index.php?com=page1&t=$1&l=$2 [QSA]
RewriteRule ^([1-9]+)/([1-9]+)/.* /sortir/index.php?com=page2&t=$1&v=$2 [QSA]
If I delete first, second works.
If I delete second, first works.
The link called for the first is like :
http://example.com/33/xxxx/city so $1 is 33 and $2 is city
The link called for the second is like :
http://example.com/33/432/xxxx/city/yyyyy so $1 is 33 and $2 is 432
Although, as anubhava notes, more details would be helpful, there are at least a few issues I can point out with your current rules.
First, reverse the order of the rules. The second rule is less general since it starts with two sections of numbers and additional sections of text. Match that first, then match the more general rule.
Second, end each rule with the L flag, otherwise processing will continue to the second rule after the first is finished.
Third, update your matches so that they don't match a slash. This forces the pattern to match the exact directory structure you're looking for rather than matching any arbitrary number of directory levels.
With those things in mind, here are some updated rules to play with:
First rule matches http://example.com/33/432/xxxx/city/yyyy
Second rule matches http://example.com/33/xxxx/city
RewriteRule ^([1-9]+)/([1-9]+)/[^/]+/[^/]+/.* /sortir/index.php?com=page2&t=$1&v=$2 [QSA, L]
RewriteRule ^([1-9]+)/[^/]*/([^/]*)$ /sortir/index.php?com=page1&t=$1&l=$2 [QSA, L]
If this is not the exact rule set you need, it should at least get you closer.
Both conditions are overlapping since this regex:
^([1-9]+)/.*/(.*)
will also match
^([1-9]+)/.*/(.*)
hence only one will fire. Why don't you explain your requirements clearly then we can help you write RewriteRule in unambiguous manner.

Resources