mod_rewrite RewriteRule backreference use in pattern - .htaccess

Is it possible to use a backreference in the middle of the RewriteRule pattern?
I envision this as such:
RewriteRule ^(.)([a-z]*)$1([a-z]*) $2-$3
^
(note, this fellow, here, refers to that first "(.)")
If the received url is XboopXdoop, the result would of course be boop-doop.
I am attempting to use this to specify a delimiter at the beginning of the incoming url that can be used to parse the rest of the string, without forcing the use of a specific character as that delimiter.
Thank you.

$1 works on the right side (rewrite), but not in the regex. You need to use \1.
Try:
RewriteRule ^(.)([a-z]+)\1([a-z]+) $2-$3
I ran into a bizarre edge case with the * where it split based on the second character of the string, and not the second. XtestingXtest resulted in es-ing ... so yeah, not sure what was happening there. If I use a + it works fine.
Also, since * and + are greedy, if you have multiple delimiter characters, it will split on the last occurrence of the character:
XbaseXtest -> base-test
XbaseXteXst -> baseXte-st
XbaseXtestX -> baseXtest-

Related

Is it possible to put a line break into a 'mailto:' rewrite in htaccess?

For complex reasons I've had to remove an enquiry form from a web site and use a 'mailto:' instead. For simplicity I've changed the htaccess file so that the former 'contact' link to the form now becomes a 'mailto:' as follows:
RewriteRule ^contact$ mailto:myname#mydomain.com?subject=BusinessName\ BandB\ Enquiry&body=You\ can\ find\ our\ availability\ on\ line.\ Delete\ this\ content\ if\ inapplicable
That does work, my local e-mail client (Thunderbird) opens with the information correctly shown in subject and body. (My TB is set to compose in plain text, I've yet to test with HTML)
I would like to introduce a new line in the body so that 'Delete this content if inapplicable' is on a separate line. Is there any way to do this? Given mod_rewrite's intended purpose I could understand if there isn't but I thought I'd ask before giving up.
I would like to introduce a new line in the body so that 'Delete this content if inapplicable' is on a separate line.
New lines in the body need are represented by two characters: carriage return (char 13) + line feed (char 10) (see RFC2368). This would need to be URL encoded in the resulting URL as %0D%0A.
When used in the RewriteRule substitution string the literal % characters would need to backslash-escaped to negate their special meaning as a backreference to the preceding CondPattern (which there isn't one). ie. \%0D\%0A. Otherwise, you will end up with the string DA, because there is no %0 backreference in this example.
You can also avoid having to backslash-escape all the literal spaces by encloses the entire argument (substitution string) in double quotes.
So, try the following instead:
RewriteRule ^contact$ "mailto:myname#mydomain.com?subject=BusinessName BandB Enquiry&body=You can find our availability on line.\%0D\%0ADelete this content if inapplicable" [R,L]

RewriteRule cuts off part of a variable name

I have a RewriteRule inside my .htaccess file:
RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]
For most cases it works fine, however, if the title starts with the word "club" that word is cut off.
For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars
In the following URL:
mysite.com/tx/travis/club/fast-cars-club
$_GET['title'] == 'fast-cars-club'
But in the this URL:
mysite.com/tx/travis/club/club-fast-cars
$_GET['title'] == '-fast-cars'
What am I missing?
Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:
^ start of string
[a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
/ a literal slash, not optional, not captured
([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
/club the literal string /club, not optional, not captured
/? a literal slash, optional (specifically, ? means must occur zero or one times)
[a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)? the above capture group as a whole is optional
/? a literal slash, optional
$ end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
/club is followed by /, so /? matches
fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
there is no following /, but that's fine: /? can match zero occurrences
the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
exactly two letters (the state) [a-zA-Z]{2}
/
one or more letters or hyphens (the county) [a-zA-Z-]+
/
the literal word club
/
one or more letters or hyphens (the title) [a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.
Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.

htaccess RewriteRule - pattern not matching

I'm trying to use categories in a clean way in my urls like this:
website.com/category
In the url the categories are written like this: Some random examples:
Animals
Consumer-Electronics
Books-&-Comics
External-Hard-Discs
Form,-Beauty-&-Health
Black-&-White-TV
The-Adventures-Of-Tintin
Fryers,-Waffle-makers-&-Cooking
etc...
As you can see, there is a random combination of words (with starting upper case), characters "-", ",", and "&". There are more combinations than the examples.
With rewrite I'm trying to get the categories in a variable like this:
RewriteRule ^([\w-&]+)$ /categories.php?mcn=$1 [L,NC]
This is not working. If I read out the variable I wanted with "Books-&-Comics" in categories.php, I only get "Books-" while it should be "Books-&-Comics".
When I add a "," in the character class like this:
RewriteRule ^([\w,-&]+)$ /categories.php?mcn=$1 [L,NC]
I get an internal server error.
How should my RewriteRule look like to match the category examples and get them correctly in the variable?
For your first problem, the issue is that your parameters are being decoded and thus the & is starting a new URL parameter. You can fix this by adding a B flag to your rule.
Your second issue is that the pattern ^([\w,-&]+)$ is invalid. It is trying to match any word character, or any character between , and &. (Ascii 44 & 38) because this is out of order, the regex fails. As you want to match the - character rather than using it as a range indicator, it should be escaped.
With these changes made your rule is:
RewriteRule ^([\w,\-&]+)$ /categories.php?mcn=$1 [L,NC,B]
A regex helper like regex101 can be a huge help in creating your rules.

what does $1 in .htaccess file mean?

I am trying to understand the meaning of this line in the .htaccess file
RewriteRule ([a-z0-9/-]+).html $1.php [NC,L,QSA]
basically what does $1.php ? what file in the server
if we have home.html where this gonna redirect to? home.php?
$1 is the first captured group from your regular expression; that is, the contents between ( and ). If you had a second set of parentheses in your regex, $2 would contain the contents of those parens. Here is an example:
RewriteRule ([a-z0-9/-]+)-([a-z]+).html$ $1-$2.php [NC,L,QSA]
Say a user navigates to hello-there.html. They would be served hello-there.php. In your substitution string, $1 contains the contents of the first set of parens (hello), while $2 contains the contents of the second set (there). There will always be exactly as many "dollar" values available in your substitution string as there are sets of capturing parentheses in your regex.
If you have nested parens, say, (([a-z]+)-[a-z]+), $1 always refers to the outermost capture (in this case the whole regex), $2 is the first nested set, and so on.
.htaccess files can contain a wide variety of Apache configuration directives, but this one, like many, is to do with the URL rewriting module, mod_rewrite.
A RewriteRule directive has 3 parts:
a Pattern (regular expression) which needs to match against the current URL
a Substitution string, representing the URL to serve instead, or instruct the browser to redirect to
an optional set of flags
In this case, you have a regular expression which matches anything ending in .html which consists only of letters a-z, digits 0-9, / and -. However, it also contains a set of parentheses (...), which mark a part of the pattern to be "captured".
The Substitution string can then reference this "captured" value; the first capture is $1, and the second would be $2, and so on.
In this case, the captured part is everything before the .html, and the Substitution is $1.php, meaning whatever string came before .html is kept, but the .html is thrown away and .php is stuck on instead.
So for your specific example, accessing home.html will instead act as though you had requested home.php.
It's a reference to the first capture group denoted by the parentheses in the pattern ([a-z0-9/-]+).html$. If there were two (.*)-(.*) then you would access $1 for the first capture group and $2 for the second, etc...
$1 refers to the first group caught by your regex (ie between parenthesis). In your case it refers to :
([a-z0-9/-]+)
For the URL mypage.html, $1 will contain "mypage", and the rule will redirect to mypage.php.

.htaccess: how to make a digit parameter optional?

In my example, this URL works with these parameters:
www.website.com/search/dogs-10-cats-5
what I want is to have one of the parameters (a digit or empty) to be optional like:
www.website.com/search/dogs--cats-3
Is this possible? Here is my current rewrite rule:
RewriteRule ^search/dogs-?([0-9]+)-cats-?([0-9]+)/? index.php?SearchResults&Dogs=$1&Cats=$2 [L]
dogs-?([0-9]*)-cats-?([0-9]+)
instead of
dogs-?([0-9]+)-cats-?([0-9]+)
+ means "one or more", * means "zero or more"
I would also recommend to omit the ?s, because this makes every second dash optional (may be intended, but I guess not)
dogs10-cats5

Resources