How to trim right and left side a url? - python-3.x

I have list of websites unfortunately which looks like "rs--google.com--plain" how to remove 'rs--' and '--plain' from the url? I tried strip() but it didn't remove anything.

The way to remove "rs--" and "--plain" from that url (which is a string most likely) is to use some basic regex on it:
import re
url = 'rs--google.com--plain'
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
print(cleaned_url)
Which prints out:
google.com
What is done here is use re's search module to check if anything exists between "rs--" and "--plain" and if it does match it to group 1, we then check for group 1 by doing .group(1) and set our entire "cleaned url" to it:
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
And now we only "google.com" in our cleaned_url.
This assumes "rs--" and "--plain" are always in the url.
Updated to handle any letters on either side of --:
import re
url = 'po--google.com--plain'
cleaned_url = re.search('[A-z]+--(.*)--[A-z]+', url).group(1)
print(cleaned_url)
This will handle anything that has letters before -- and after -- and get only the url in the middle. What that does is check any letters on either side of -- regardless of how many letters are there. This will allow queries with letters that match that regular expression so long as --myurl.com-- letters exist before the first "--" and after the second "--"
A great resource for working on regex is regex101

You can use replace function in python.
>>> val = "rs--google.com--plain"
>>> newval =val.replace("rs--","").replace("--plain","")
>>> newval
'google.com'

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

Regular expression to match string that groups together with dot

Hi I am trying to create a regular expression with the rules:
portion before '.com' or '.edu' can only have at most 10 alphabets
if this portion does not exist, then it should only return 'com'
For example,
'stack.com' is valid
'stackoverflow.com' is not valid as it has more than 10 alpha before .com
'.com' is not valid while 'com' is valid
Here is what I have so far:
regex = r'^([A-Za-z]{,10}\.)?(com|edu)'
re.match(regex, 'com')
I am trying to group the portion before (com|edu) together, so that if it does not exist, then the . will also not be there.
Given the condition ".com is not valid while com is valid", I think your expression is the right one and you just have to do some processing afterwards:
import re
full_string = """stack.com
stackoverflow.com
.com
foo.edu
com
bar
sitcom
sit.com"""
regex = r'^([A-Za-z]{,10}\.)?(com|edu)$'
for base, domain in re.findall(regex, full_string, re.MULTILINE):
if base not in (".", ""):
print(base.strip("."))
else: # nothin before com/edu
print(domain)
Edit: if you want to completely exclude .com (and not change it to com) you can still go with:
regex = r'^(?:[A-Za-z]{1,10}\.)?(?:com|edu)$'
print(re.findall(regex, full_string, re.MULTILINE))

How to get the content after a string using regex in python

I am having a string as follows:
A5697[2:10] = {ravi, rageev, raghav, smith};
I want the content after "A5697[2:10] =". So, my output should be:
{ravi, rageev, raghav, smith};
This is my code:
print(re.search(r'(?<=A\d+\[.*\] =\s).*', line).group())
But, this is giving error:
sre_constants.error: look-behind requires fixed-width pattern
Can anyone help to solve this issue? I would prefer to use regex.
You can try re.sub , like below, Since you have given only one data point. I am assuming all the other data points are following the similar pattern.
import re
text = "A5697[2:10] = {ravi, rageev, raghav, smith}"
re.sub(r'(A\d+\[\d+:\d+\]\s+=\s+)(.+)', r'\2', text)
returns,
'{ravi, rageev, raghav, smith}'
re.sub : substitutes the entire match as given as regex with the 2nd capturing group. The second capturing group captures every thing after '= '.
Simply replace the bits you don't want:
print re.sub(r'A\d[^=]*= *','',line)
See demo here: https://rextester.com/NSG17655

how to pass in list variable in url string witout "[ ] '" in python

I need to format a url string for using with urllib:
the url string I want to get are as:
http://localhost:8086/service/records/names?name=A,B,C,D,E,F,G
if I use
namelist = ['A','B','C','D','E','F','G']
url = 'http://localhost:8086/service/records/names?name={namelist}'.format(namelist=namelist}
then I get:
http://localhost:8086/service/records/names?name=['A','B','C','D','E','F','G']
so how should I format an url string by passing in a string list wihout "[]'"?
Join the list into a string with...
'[insert seperator here]'.join(namelist)
so in your case
','.join(namelist)
this produces 'A,B,C,D,E,F'...
Then you can use your initial method with the .format()
Your first option is what the other answers suggest: to create your comma-separated list yourself, like so:
import urllib.parse
query_string = 'name=' + ','.join(namelist)
url = 'http://localhost:8086/service/records/names?{query_string}'.format(query_string=query_string)
# url == 'http://localhost:8086/service/records/names?name=A,B,C,D,E,F,G'
This fits what you asked for, however it has some limitations: first, if one of your names has a comma, it will not be correctly escaped.
Second, the commas in your list, and other characters in namelist won't be properly encoded for the URL.
Your second option, a more robust version of the previous one, is to encode your list, like so:
import urllib.parse
query_params = {'name': ','.join(namelist)}
query_string = urllib.parse.urlencode(query_params)
url = 'http://localhost:8086/service/records/names?{query_string}'.format(query_string=query_string)
# url == 'http://localhost:8086/service/records/names?name=A%2CB%2CC%2CD%2CE%2CF%2CG'
This will properly escape the characters for URL usage, but you are still left with the manual assembling and parsing of the query string.
There is a third option, which I would suggest: use the standard way of passing a list in the query string, which is to repeat the key.
import urllib.parse
query_params = {'name': namelist}
query_string = urllib.parse.urlencode(query_params, doseq=True)
url = 'http://localhost:8086/service/records/names?{query_string}'.format(query_string=query_string)
# url == 'http://localhost:8086/service/records/names?name=A&name=B&name=C&name=D&name=E&name=F&name=G'
This last option, a bit more verbose, is more robust though, as the URL parser will return a list you don't need to parse.
Additionally, if there is a comma in one of your names, it will be automatically escaped.
Check out the difference between the three options:
>>> urllib.parse.parse_qs('name=A,B,C,D,E,F,G')
{'name': ['A,B,C,D,E,F,G']}
>>> urllib.parse.parse_qs('name=A%2CB%2CC%2CD%2CE%2CF%2CG')
{'name': ['A,B,C,D,E,F,G']}
>>> urllib.parse.parse_qs('name=A&name=B&name=C&name=D&name=E&name=F&name=G')
{'name': ['A', 'B', 'C', 'D', 'E', 'F', 'G']}
Last one will be easier to work with!
.format(namelist=",".join(namelist))
will work, using "," between list entries
In addition to the other answers: you can also use the nice F-string feature of Python 3. It's a lot prettier than .format() imo, and reminds you of other programming languages that allow variable interpolation.
namelist = ['A','B','C','D','E','F','G']
url = f"http://localhost:8086/service/records/names?name={','.join(namelist)}"
print(url)
# http://localhost:8086/service/records/names?name=A,B,C,D,E,F,G

Alternative to .replace() for replacing multiple substrings in a string

Are there any alternatives that are similar to .replace() but that allow you to pass more than one old substring to be replaced?
I have a function with which I pass video titles so that specific characters can be removed (because the API I'm passing the videos too has bugs that don't allow certain characters):
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
bugFixVidName = vidName.replace(":", "")
search_url ='https://api.brightcove.com/services/library?command=search_videos&video_fields=name&page_number=0&get_item_count=true&token=kwSt2FKpMowoIdoOAvKj&any=%22{}%22'.format(bugFixVidName)
Right now, it's eliminating ":" from any video titles with vidName.replace(":", "") but I also would like to replace "|" when that occurs in the name string sorted in the vidName variable. Is there an alternative to .replace() that would allow me to replace more than one substring at a time?
>>> s = "a:b|c"
>>> s.translate(None, ":|")
'abc'
You may use re.sub
import re
re.sub(r'[:|]', "", vidName)

Resources