I have a RewriteRule inside my .htaccess file:
RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]
For most cases it works fine, however, if the title starts with the word "club" that word is cut off.
For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars
In the following URL:
mysite.com/tx/travis/club/fast-cars-club
$_GET['title'] == 'fast-cars-club'
But in the this URL:
mysite.com/tx/travis/club/club-fast-cars
$_GET['title'] == '-fast-cars'
What am I missing?
Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:
^ start of string
[a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
/ a literal slash, not optional, not captured
([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
/club the literal string /club, not optional, not captured
/? a literal slash, optional (specifically, ? means must occur zero or one times)
[a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)? the above capture group as a whole is optional
/? a literal slash, optional
$ end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
/club is followed by /, so /? matches
fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
there is no following /, but that's fine: /? can match zero occurrences
the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
exactly two letters (the state) [a-zA-Z]{2}
/
one or more letters or hyphens (the county) [a-zA-Z-]+
/
the literal word club
/
one or more letters or hyphens (the title) [a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.
Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.
Related
I have an html text. With my regex:
r'(http[\S]?://[\S]+/favicon\.ico[\S^,]+)"'
and with re.findall(), I get this result from it:
['https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196', 'https://stackoverflow.com/favicon.ico,https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196']
But i dont want this second result in list, i understand that it has coma inside, but i have no idea how to exclude coma from my regex. I use re.findall() in order to find necessery link in any place in html text because i dont know where it could be.
Note that [\S]+ contains redundant character class, it is the same as \S+. In http[\S]?://, [\S]? is most likely a human error, as [\S]? matches any optional non-whitespace char. I doubt you implied to match http§:// protocol. Just use s to match s, or S to match S.
You can use
https?://[^\s",]*/favicon\.ico[^",]+
See the regex demo.
Details:
https?:// - http:// or https://
[^\s",]* - zero or more chars other than whitespace, " and , chars
/favicon\.ico - a fixed /favicon.ico string
[^",]+ - one or more chars other than a " and , chars.
I am trying to find image files in a css file using Python re find all. The following works except it only finds the first image in the CSS file and ignores the rest. How do I make it to grab all image links?
img_links_in_css = re.findall('^.(url|URL|Url|uRL|uRl)\s(\s*(.+.(png|jpg|gif|jpeg|svg))\s*).*?$', str(css))
There are some problems in your expression:
The .+ and .* tokens (wich are greedy quantifiers) makes the RegEx match the first occurence and then capture all remaining characters of the string (especially if the CSS is minified); and
The tokens ^ and $ will only if the CSS isn't minified (all in one line) and if you use the multi-line flag (re.Mor re.MULTILINE);
So, you could change it to (for non-minified CSS):
pattern = r'^.+(?:uRl|URL|Url|uRL|Uri)\s?(\s*(?:.+.(?:png|jpg|gif|jpeg|svg))\s*).*?$'
re.findall(pattern, str(css), re.M)
To work with minified CSS you have to eliminate the .+ and .* tokens as well. A simplier expression can be used for this:
pattern = r'url\s*\(([^)]+)'
re.findall(pattern, str(css), re.I)
Where:
url\*: matches any combination of the letters U, R and L, modified by the re.I flag to ignore cases. ([Uu][Rr][Ll] could be used instead);
\s*: preceding or not by whitespaces;
\(: an open parentheses;
And finnaly, the group ([^)]+) matching any character different than ).
Example:
>>> css = 'body{background-attachment:fixed;background-image:uRl(./Images/bg4.png)}.img-default{background-image:Url(./images/def.jpg)}div#header{\nbackground-image:url(images/header-background.jpg)\n}'
>>> re.findall(r'url\(([^)]+)', css, re.I)
['./Images/bg4.png', './images/def.jpg', 'images/header-background.jpg']
in your regex ^ matches the start of a new line (or the entire file) and $ matches the end. Therefor your regex matches the entire file (because of the .* at the end) and you have only one (non-overlapping) match.
Instead you should search for the following:
r'(url|URL|Url|uRL|uRl)\s(\s*(.+?\.(png|jpg|gif|jpeg|svg))\s*)'
The changes are
removing ^.* and .*$ at beginning and end.
.+? instead of .+ for making int non-ambiguous (matching the smallest possible string)
searching for an actual "." should be done with \. or [.]
Note that the \s* is not necessary and that \s\s* can be replaced with \s+ if it's not a matter of capturing-groups.
Also take care of what groups you want. Each (...) is a group that can be accessed for non-capturing groups use (?:...).
Maybe like this (depending of which parts you want):
r'(?:url|URL|Url|uRL|uRl)\s\s*.+?\.(?:png|jpg|gif|jpeg|svg)'
or
r'(?:url|URL|Url|uRL|uRl)\s\s*(.+?)\.(?:png|jpg|gif|jpeg|svg)'
for capturing only the part inside (in Python these capturing-groups are accessed with \g<1> if you need to process them).
I am trying to understand the meaning of this line in the .htaccess file
RewriteRule ([a-z0-9/-]+).html $1.php [NC,L,QSA]
basically what does $1.php ? what file in the server
if we have home.html where this gonna redirect to? home.php?
$1 is the first captured group from your regular expression; that is, the contents between ( and ). If you had a second set of parentheses in your regex, $2 would contain the contents of those parens. Here is an example:
RewriteRule ([a-z0-9/-]+)-([a-z]+).html$ $1-$2.php [NC,L,QSA]
Say a user navigates to hello-there.html. They would be served hello-there.php. In your substitution string, $1 contains the contents of the first set of parens (hello), while $2 contains the contents of the second set (there). There will always be exactly as many "dollar" values available in your substitution string as there are sets of capturing parentheses in your regex.
If you have nested parens, say, (([a-z]+)-[a-z]+), $1 always refers to the outermost capture (in this case the whole regex), $2 is the first nested set, and so on.
.htaccess files can contain a wide variety of Apache configuration directives, but this one, like many, is to do with the URL rewriting module, mod_rewrite.
A RewriteRule directive has 3 parts:
a Pattern (regular expression) which needs to match against the current URL
a Substitution string, representing the URL to serve instead, or instruct the browser to redirect to
an optional set of flags
In this case, you have a regular expression which matches anything ending in .html which consists only of letters a-z, digits 0-9, / and -. However, it also contains a set of parentheses (...), which mark a part of the pattern to be "captured".
The Substitution string can then reference this "captured" value; the first capture is $1, and the second would be $2, and so on.
In this case, the captured part is everything before the .html, and the Substitution is $1.php, meaning whatever string came before .html is kept, but the .html is thrown away and .php is stuck on instead.
So for your specific example, accessing home.html will instead act as though you had requested home.php.
It's a reference to the first capture group denoted by the parentheses in the pattern ([a-z0-9/-]+).html$. If there were two (.*)-(.*) then you would access $1 for the first capture group and $2 for the second, etc...
$1 refers to the first group caught by your regex (ie between parenthesis). In your case it refers to :
([a-z0-9/-]+)
For the URL mypage.html, $1 will contain "mypage", and the rule will redirect to mypage.php.
Is it possible to use a backreference in the middle of the RewriteRule pattern?
I envision this as such:
RewriteRule ^(.)([a-z]*)$1([a-z]*) $2-$3
^
(note, this fellow, here, refers to that first "(.)")
If the received url is XboopXdoop, the result would of course be boop-doop.
I am attempting to use this to specify a delimiter at the beginning of the incoming url that can be used to parse the rest of the string, without forcing the use of a specific character as that delimiter.
Thank you.
$1 works on the right side (rewrite), but not in the regex. You need to use \1.
Try:
RewriteRule ^(.)([a-z]+)\1([a-z]+) $2-$3
I ran into a bizarre edge case with the * where it split based on the second character of the string, and not the second. XtestingXtest resulted in es-ing ... so yeah, not sure what was happening there. If I use a + it works fine.
Also, since * and + are greedy, if you have multiple delimiter characters, it will split on the last occurrence of the character:
XbaseXtest -> base-test
XbaseXteXst -> baseXte-st
XbaseXtestX -> baseXtest-
I have a string like hello /world today/
I need to replace /world today/ with /MY NEW STRING/
Reading the manual I have found
newString = string.match("hello /world today/","%b//")
which I can use with gsub to replace, but I wondered is there also an elegant way to return just the text between the /, I know I could just trim it, but I wondered if there was a pattern.
Try something like one of the following:
slashed_text = string.match("hello /world today/", "/([^/]*)/")
slashed_text = string.match("hello /world today/", "/(.-)/")
slashed_text = string.match("hello /world today/", "/(.*)/")
This works because string.match returns any captures from the pattern, or the entire matched text if there are no captures. The key then is to make sure that the pattern has the right amount of greediness, remembering that Lua patterns are not a complete regular expression language.
The first two should match the same texts. In the first, I've expressly required that the pattern match as many non-slashes as possible. The second (thanks lhf) matches the shortest span of any characters at all followed by a slash. The third is greedier, it matches the longest span of characters that can still be followed by a slash.
The %b// in the original question doesn't have any advantages over /.-/ since the the two delimiters are the same character.
Edit: Added a pattern suggested by lhf, and more explanations.