Python findall() RE - python-3.x

I am trying to find image files in a css file using Python re find all. The following works except it only finds the first image in the CSS file and ignores the rest. How do I make it to grab all image links?
img_links_in_css = re.findall('^.(url|URL|Url|uRL|uRl)\s(\s*(.+.(png|jpg|gif|jpeg|svg))\s*).*?$', str(css))

There are some problems in your expression:
The .+ and .* tokens (wich are greedy quantifiers) makes the RegEx match the first occurence and then capture all remaining characters of the string (especially if the CSS is minified); and
The tokens ^ and $ will only if the CSS isn't minified (all in one line) and if you use the multi-line flag (re.Mor re.MULTILINE);
So, you could change it to (for non-minified CSS):
pattern = r'^.+(?:uRl|URL|Url|uRL|Uri)\s?(\s*(?:.+.(?:png|jpg|gif|jpeg|svg))\s*).*?$'
re.findall(pattern, str(css), re.M)
To work with minified CSS you have to eliminate the .+ and .* tokens as well. A simplier expression can be used for this:
pattern = r'url\s*\(([^)]+)'
re.findall(pattern, str(css), re.I)
Where:
url\*: matches any combination of the letters U, R and L, modified by the re.I flag to ignore cases. ([Uu][Rr][Ll] could be used instead);
\s*: preceding or not by whitespaces;
\(: an open parentheses;
And finnaly, the group ([^)]+) matching any character different than ).
Example:
>>> css = 'body{background-attachment:fixed;background-image:uRl(./Images/bg4.png)}.img-default{background-image:Url(./images/def.jpg)}div#header{\nbackground-image:url(images/header-background.jpg)\n}'
>>> re.findall(r'url\(([^)]+)', css, re.I)
['./Images/bg4.png', './images/def.jpg', 'images/header-background.jpg']

in your regex ^ matches the start of a new line (or the entire file) and $ matches the end. Therefor your regex matches the entire file (because of the .* at the end) and you have only one (non-overlapping) match.
Instead you should search for the following:
r'(url|URL|Url|uRL|uRl)\s(\s*(.+?\.(png|jpg|gif|jpeg|svg))\s*)'
The changes are
removing ^.* and .*$ at beginning and end.
.+? instead of .+ for making int non-ambiguous (matching the smallest possible string)
searching for an actual "." should be done with \. or [.]
Note that the \s* is not necessary and that \s\s* can be replaced with \s+ if it's not a matter of capturing-groups.
Also take care of what groups you want. Each (...) is a group that can be accessed for non-capturing groups use (?:...).
Maybe like this (depending of which parts you want):
r'(?:url|URL|Url|uRL|uRl)\s\s*.+?\.(?:png|jpg|gif|jpeg|svg)'
or
r'(?:url|URL|Url|uRL|uRl)\s\s*(.+?)\.(?:png|jpg|gif|jpeg|svg)'
for capturing only the part inside (in Python these capturing-groups are accessed with \g<1> if you need to process them).

Related

Is it possible to put a line break into a 'mailto:' rewrite in htaccess?

For complex reasons I've had to remove an enquiry form from a web site and use a 'mailto:' instead. For simplicity I've changed the htaccess file so that the former 'contact' link to the form now becomes a 'mailto:' as follows:
RewriteRule ^contact$ mailto:myname#mydomain.com?subject=BusinessName\ BandB\ Enquiry&body=You\ can\ find\ our\ availability\ on\ line.\ Delete\ this\ content\ if\ inapplicable
That does work, my local e-mail client (Thunderbird) opens with the information correctly shown in subject and body. (My TB is set to compose in plain text, I've yet to test with HTML)
I would like to introduce a new line in the body so that 'Delete this content if inapplicable' is on a separate line. Is there any way to do this? Given mod_rewrite's intended purpose I could understand if there isn't but I thought I'd ask before giving up.
I would like to introduce a new line in the body so that 'Delete this content if inapplicable' is on a separate line.
New lines in the body need are represented by two characters: carriage return (char 13) + line feed (char 10) (see RFC2368). This would need to be URL encoded in the resulting URL as %0D%0A.
When used in the RewriteRule substitution string the literal % characters would need to backslash-escaped to negate their special meaning as a backreference to the preceding CondPattern (which there isn't one). ie. \%0D\%0A. Otherwise, you will end up with the string DA, because there is no %0 backreference in this example.
You can also avoid having to backslash-escape all the literal spaces by encloses the entire argument (substitution string) in double quotes.
So, try the following instead:
RewriteRule ^contact$ "mailto:myname#mydomain.com?subject=BusinessName BandB Enquiry&body=You can find our availability on line.\%0D\%0ADelete this content if inapplicable" [R,L]

Cant create correct regex

I have an html text. With my regex:
r'(http[\S]?://[\S]+/favicon\.ico[\S^,]+)"'
and with re.findall(), I get this result from it:
['https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196', 'https://stackoverflow.com/favicon.ico,https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196']
But i dont want this second result in list, i understand that it has coma inside, but i have no idea how to exclude coma from my regex. I use re.findall() in order to find necessery link in any place in html text because i dont know where it could be.
Note that [\S]+ contains redundant character class, it is the same as \S+. In http[\S]?://, [\S]? is most likely a human error, as [\S]? matches any optional non-whitespace char. I doubt you implied to match http§:// protocol. Just use s to match s, or S to match S.
You can use
https?://[^\s",]*/favicon\.ico[^",]+
See the regex demo.
Details:
https?:// - http:// or https://
[^\s",]* - zero or more chars other than whitespace, " and , chars
/favicon\.ico - a fixed /favicon.ico string
[^",]+ - one or more chars other than a " and , chars.

RewriteRule cuts off part of a variable name

I have a RewriteRule inside my .htaccess file:
RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]
For most cases it works fine, however, if the title starts with the word "club" that word is cut off.
For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars
In the following URL:
mysite.com/tx/travis/club/fast-cars-club
$_GET['title'] == 'fast-cars-club'
But in the this URL:
mysite.com/tx/travis/club/club-fast-cars
$_GET['title'] == '-fast-cars'
What am I missing?
Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:
^ start of string
[a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
/ a literal slash, not optional, not captured
([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
/club the literal string /club, not optional, not captured
/? a literal slash, optional (specifically, ? means must occur zero or one times)
[a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)? the above capture group as a whole is optional
/? a literal slash, optional
$ end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
/club is followed by /, so /? matches
fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
there is no following /, but that's fine: /? can match zero occurrences
the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
exactly two letters (the state) [a-zA-Z]{2}
/
one or more letters or hyphens (the county) [a-zA-Z-]+
/
the literal word club
/
one or more letters or hyphens (the title) [a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.
Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.

mod_rewrite RewriteRule backreference use in pattern

Is it possible to use a backreference in the middle of the RewriteRule pattern?
I envision this as such:
RewriteRule ^(.)([a-z]*)$1([a-z]*) $2-$3
^
(note, this fellow, here, refers to that first "(.)")
If the received url is XboopXdoop, the result would of course be boop-doop.
I am attempting to use this to specify a delimiter at the beginning of the incoming url that can be used to parse the rest of the string, without forcing the use of a specific character as that delimiter.
Thank you.
$1 works on the right side (rewrite), but not in the regex. You need to use \1.
Try:
RewriteRule ^(.)([a-z]+)\1([a-z]+) $2-$3
I ran into a bizarre edge case with the * where it split based on the second character of the string, and not the second. XtestingXtest resulted in es-ing ... so yeah, not sure what was happening there. If I use a + it works fine.
Also, since * and + are greedy, if you have multiple delimiter characters, it will split on the last occurrence of the character:
XbaseXtest -> base-test
XbaseXteXst -> baseXte-st
XbaseXtestX -> baseXtest-

Lua Pattern for extracting/replacing value in / /

I have a string like hello /world today/
I need to replace /world today/ with /MY NEW STRING/
Reading the manual I have found
newString = string.match("hello /world today/","%b//")
which I can use with gsub to replace, but I wondered is there also an elegant way to return just the text between the /, I know I could just trim it, but I wondered if there was a pattern.
Try something like one of the following:
slashed_text = string.match("hello /world today/", "/([^/]*)/")
slashed_text = string.match("hello /world today/", "/(.-)/")
slashed_text = string.match("hello /world today/", "/(.*)/")
This works because string.match returns any captures from the pattern, or the entire matched text if there are no captures. The key then is to make sure that the pattern has the right amount of greediness, remembering that Lua patterns are not a complete regular expression language.
The first two should match the same texts. In the first, I've expressly required that the pattern match as many non-slashes as possible. The second (thanks lhf) matches the shortest span of any characters at all followed by a slash. The third is greedier, it matches the longest span of characters that can still be followed by a slash.
The %b// in the original question doesn't have any advantages over /.-/ since the the two delimiters are the same character.
Edit: Added a pattern suggested by lhf, and more explanations.

Resources