Regular expression to match string that groups together with dot - python-3.x

Hi I am trying to create a regular expression with the rules:
portion before '.com' or '.edu' can only have at most 10 alphabets
if this portion does not exist, then it should only return 'com'
For example,
'stack.com' is valid
'stackoverflow.com' is not valid as it has more than 10 alpha before .com
'.com' is not valid while 'com' is valid
Here is what I have so far:
regex = r'^([A-Za-z]{,10}\.)?(com|edu)'
re.match(regex, 'com')
I am trying to group the portion before (com|edu) together, so that if it does not exist, then the . will also not be there.

Given the condition ".com is not valid while com is valid", I think your expression is the right one and you just have to do some processing afterwards:
import re
full_string = """stack.com
stackoverflow.com
.com
foo.edu
com
bar
sitcom
sit.com"""
regex = r'^([A-Za-z]{,10}\.)?(com|edu)$'
for base, domain in re.findall(regex, full_string, re.MULTILINE):
if base not in (".", ""):
print(base.strip("."))
else: # nothin before com/edu
print(domain)
Edit: if you want to completely exclude .com (and not change it to com) you can still go with:
regex = r'^(?:[A-Za-z]{1,10}\.)?(?:com|edu)$'
print(re.findall(regex, full_string, re.MULTILINE))

Related

Python regex - check if pattern contains capturing named group

How can I check whether regex pattern contains a named capturing group? I want to decide whether to use re.findall or re.finditer based on the form of the regex.
Use the following approach:
pat = '.*(?P<word>\w+\d+\b).+' # sample pattern
has_named_group = bool(re.search(r'\(\?P<\w+>[^)]+\)', pat))
This can also be a function:
def has_named_group(pat):
return bool(re.search(r'\(\?P<\w+>[^)]+\)', pat))
You can use Pattern.groupindex
A dictionary mapping any symbolic group names defined by (?P) to
group numbers. The dictionary is empty if no symbolic groups were used
in the pattern.
For example
import re
pattern = re.compile('(?P<mygroup>.*)')
if pattern.groupindex:
print("The pattern contains a named capturing group")
else:
print("The pattern does not contain a named capturing group")
Output
The pattern contains a named capturing group

Extract value from event path - lambda function

The lambda I am working on gets triggered through API gateway.
I want to extract the a specific value from the path in the URL.
Sample URL : {id}/contacts
or
{id-0}/{id}/contacts
In order to extract the path variable I am using event.pathParamters
which gives me the value, but I need to only extract {id} from the path.
I am using the following code to split the path param and extract the {id}, but this is not a feasible option:
arr = path.split("/");
id = arr[arr.length-2];
Are there better ways to extract {id}? The position of this id will be always last right before api name (in his case <<contacts>>).
This would extract the string which is located between the last two occurrences of / or the first occurrence if two / do not exist
([^\/]+)\/[^\/]+$
https://regex101.com/r/tZNhrk/1
Would you please try the following;
import re
str = '{id-0}/{id}/contacts' # example
api_name = 'contacts' # api name
m = re.search(r'[^/]+(?=/%s)' % api_name, str)
if m:
id = m.group()
The regex [^/]+(?=/%s) matches a string of non-slash characters which is followed by a slash and the specified api_name. If the regex matches, m.group() is assigned to it.

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to get demangled function name using regex

I have list of demangled-function names like _Z6__comp7StudentS_
_Z4SortiSt6vectorI7StudentSaIS0_EE. I read wiki and found out that it follows some sort of defined structure. _Z is mangled Symbol followed by a number and then the function name of that length.
So I wanted to retrieve that function name using regex. I only come close to _Z(?:\d)(?<function_name>[a-z_A-Z]){\1}. But referring \1 won't work because its string, right? Is there a single regex pattern solution to this.
You can use 2 capture groups, and get the part of the string using the position of capture group 2
import re
pattern = r"_Z(\d+)([a-z_A-Z]+)"
s = "_Z4SortiSt6vectorI7StudentSaIS0_EE"
m = re.search(pattern, s)
if m:
print(m.group(2)[0: int(m.group(1))])
Output
Sort
Using _Z6__comp7StudentS_ will return __comp

How to trim right and left side a url?

I have list of websites unfortunately which looks like "rs--google.com--plain" how to remove 'rs--' and '--plain' from the url? I tried strip() but it didn't remove anything.
The way to remove "rs--" and "--plain" from that url (which is a string most likely) is to use some basic regex on it:
import re
url = 'rs--google.com--plain'
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
print(cleaned_url)
Which prints out:
google.com
What is done here is use re's search module to check if anything exists between "rs--" and "--plain" and if it does match it to group 1, we then check for group 1 by doing .group(1) and set our entire "cleaned url" to it:
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
And now we only "google.com" in our cleaned_url.
This assumes "rs--" and "--plain" are always in the url.
Updated to handle any letters on either side of --:
import re
url = 'po--google.com--plain'
cleaned_url = re.search('[A-z]+--(.*)--[A-z]+', url).group(1)
print(cleaned_url)
This will handle anything that has letters before -- and after -- and get only the url in the middle. What that does is check any letters on either side of -- regardless of how many letters are there. This will allow queries with letters that match that regular expression so long as --myurl.com-- letters exist before the first "--" and after the second "--"
A great resource for working on regex is regex101
You can use replace function in python.
>>> val = "rs--google.com--plain"
>>> newval =val.replace("rs--","").replace("--plain","")
>>> newval
'google.com'

Resources