I have an html text. With my regex:
r'(http[\S]?://[\S]+/favicon\.ico[\S^,]+)"'
and with re.findall(), I get this result from it:
['https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196', 'https://stackoverflow.com/favicon.ico,https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196']
But i dont want this second result in list, i understand that it has coma inside, but i have no idea how to exclude coma from my regex. I use re.findall() in order to find necessery link in any place in html text because i dont know where it could be.
Note that [\S]+ contains redundant character class, it is the same as \S+. In http[\S]?://, [\S]? is most likely a human error, as [\S]? matches any optional non-whitespace char. I doubt you implied to match http§:// protocol. Just use s to match s, or S to match S.
You can use
https?://[^\s",]*/favicon\.ico[^",]+
See the regex demo.
Details:
https?:// - http:// or https://
[^\s",]* - zero or more chars other than whitespace, " and , chars
/favicon\.ico - a fixed /favicon.ico string
[^",]+ - one or more chars other than a " and , chars.
Related
How can I remove all characters inside angle brackets including the brackets in a string? How can I also remove all the text between ("\r\n") and ("."+"any 3 characters") Is this possible? I am currently using the solution by #xkcdjerry
e.g
body = """Dear Students roads etc. you place a tree take a snapshot, then when you place a\r\nbuilding, take a snapshot. Place at least 5-6 objects and then have 5-6\r\nsnapshots. Please keep these snapshots with you as everyone will be asked\r\nto share them during the class.\r\n\r\nI am attaching one PowerPoint containing instructions and one video of\r\nexplanation for your reference.\r\n\r\nKind regards,\r\nTeacher Name\r\n zoom_0.mp4\r\n<https://drive.google.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>"""
d = re.compile("\r\n.+?\\....")
body = d.sub('', body)
a = re.compile("<.*?>")
body = a.sub('', body)
print(body)```
For some reason the output is fine except that it has:
```gle.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>
randomly attached to the end How can I fix it.
Answer
Your problem can be solved by a regex:
Put this into the shell:
import re
a=re.compile("<.*?>")
a.sub('',"Keep this part of the string< Remove this part>Keep This part as well")
Output:
'Keep this part of the stringKeep This part as well'
Second question:
import re
re.compile("\r\n.*?\\..{3}")
a.sub('',"Hello\r\nFilename.png")
Output:
'Hello'
Breakdown
Regex is a robust way of finding, replacing, and mutating small strings inside bigger ones, for further reading,consult https://docs.python.org/3/library/re.html. Meanwhile, here are the breakdowns of the regex information used in this answer:
. means any char.
*? means as many of the before as needed but as little as possible(non-greedy match)
So .*? means any number of characters but as little as possible.
Note: The reason there is a \\. in the second regex is that a . in the match needs to be escaped by a \, which in its turn needs to be escaped as \\
The methods:
re.compile(patten:str) compiles a regex for farther use.
regex.sub(repl:str,string:str) replaces every match of regex in string with repl.
Hope it helps.
I have a RewriteRule inside my .htaccess file:
RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]
For most cases it works fine, however, if the title starts with the word "club" that word is cut off.
For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars
In the following URL:
mysite.com/tx/travis/club/fast-cars-club
$_GET['title'] == 'fast-cars-club'
But in the this URL:
mysite.com/tx/travis/club/club-fast-cars
$_GET['title'] == '-fast-cars'
What am I missing?
Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:
^ start of string
[a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
/ a literal slash, not optional, not captured
([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
/club the literal string /club, not optional, not captured
/? a literal slash, optional (specifically, ? means must occur zero or one times)
[a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)? the above capture group as a whole is optional
/? a literal slash, optional
$ end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
/club is followed by /, so /? matches
fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
there is no following /, but that's fine: /? can match zero occurrences
the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
exactly two letters (the state) [a-zA-Z]{2}
/
one or more letters or hyphens (the county) [a-zA-Z-]+
/
the literal word club
/
one or more letters or hyphens (the title) [a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.
Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.
I have this for example:
<#445288012218368010>
And I want to get from between <# > symbols the value.
I tried so:
string.replace(/^(?:\<\#)(?:.*)(?:\>)$/gim, '');
But then I don't get any result. It will delete/remove the whole string.
I want only this part: 445288012218368010 (it will be dynamic, so yeah it will be not the same numbers).
Anyway it is for the discord chat bot and I know that there is other methods for check the mentioned names but I want to do that in regex because which I am trying to do can't go the common method.
So yeah how can I get the value from between those symbols?
I need this in node.js regex.
You can use String#match which will return regular expression matches for the string (in this case the RegExp would be <#(\d+)> (the parenthesis around the \d+ make \d+ become its own group). This way you can use <string>.match(/<#(\d+)>/) to get the regular expression results and <string>.match(/<#(\d+)>/)[1] to get the first group of the regex (in this case the number).
You regex matches but you use a non capturing group (?:.*) so you get the full match and replace that with an empty string. Note that you could omit the first and the third non capturing group and use <# and > instead.
You could match what is between the brackets using a capturing group ([^>]+) or (\d+) and use replace and refer the first capturing group $1 in the replacement.
console.log("<#445288012218368010>".replace(/^<#([^>]+)>$/gim, '$1'));
I wanted to correct the automatically created Linux scripts. I use findAll(String, String) function to change "$APP_ARGS" for something else.
I have tried variants:
replaceAll('"$APP_ARGS"', 'simulators ' + '"\\\\$APP_ARGS"') - doesn't find
replaceAll('\"\$APP_ARGS\"',... - doesn't find
replaceAll('"\$APP_ARGS"',... - doesn't find
replaceAll('\\"\\$APP_ARGS\\"',... - editor warning - excessive escape
replaceAll('"\\\\$APP_ARGS"',... - doesn't find
replaceAll('\\\\"\\\\$APP_ARGS\\\\"',... - doesn't find
replaceAll($/"$$APP_ARGS"/$, ...) - does not find
replaceAll('"[$]APP_ARGS"', 'something simple') - finds.
replaceAll('"[$]APP_ARGS"', '"\\\\$APP_ARGS"') - fails.
As you see, if I use the regex format, the finding works ok. But is there a way to make an escaping work? For I need that $ in the replacing string, too.
According to Groovy manuals, /../ string needn't escaping for anything except slashes themselves. But
replaceAll(/"$APP_ARGS"/,...
fails, too, with a message: Could not get unknown property 'APP_ARGS'.
It seems that behaviour of that function has no logic and we have to find the correct solution by experiments.
replaceAll('"\\$APP_ARGS"', 'simulators ' + '"\\$APP_ARGS"')
The additional possible problem is that \\ before $ should be in the both strings, replacing and replaced.
The first argument of replaceAll is always treated as an regexp, so we need to quote $ (line end). The second param may contain backreferences to groups from the regexp, which start with a $, so that one must be quoted too.
A saner way is to use replace instead of replaceAll, which already quotes/escapes both params according to that useage.
I have a string like hello /world today/
I need to replace /world today/ with /MY NEW STRING/
Reading the manual I have found
newString = string.match("hello /world today/","%b//")
which I can use with gsub to replace, but I wondered is there also an elegant way to return just the text between the /, I know I could just trim it, but I wondered if there was a pattern.
Try something like one of the following:
slashed_text = string.match("hello /world today/", "/([^/]*)/")
slashed_text = string.match("hello /world today/", "/(.-)/")
slashed_text = string.match("hello /world today/", "/(.*)/")
This works because string.match returns any captures from the pattern, or the entire matched text if there are no captures. The key then is to make sure that the pattern has the right amount of greediness, remembering that Lua patterns are not a complete regular expression language.
The first two should match the same texts. In the first, I've expressly required that the pattern match as many non-slashes as possible. The second (thanks lhf) matches the shortest span of any characters at all followed by a slash. The third is greedier, it matches the longest span of characters that can still be followed by a slash.
The %b// in the original question doesn't have any advantages over /.-/ since the the two delimiters are the same character.
Edit: Added a pattern suggested by lhf, and more explanations.