Alternative to .replace() for replacing multiple substrings in a string - string

Are there any alternatives that are similar to .replace() but that allow you to pass more than one old substring to be replaced?
I have a function with which I pass video titles so that specific characters can be removed (because the API I'm passing the videos too has bugs that don't allow certain characters):
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
bugFixVidName = vidName.replace(":", "")
search_url ='https://api.brightcove.com/services/library?command=search_videos&video_fields=name&page_number=0&get_item_count=true&token=kwSt2FKpMowoIdoOAvKj&any=%22{}%22'.format(bugFixVidName)
Right now, it's eliminating ":" from any video titles with vidName.replace(":", "") but I also would like to replace "|" when that occurs in the name string sorted in the vidName variable. Is there an alternative to .replace() that would allow me to replace more than one substring at a time?

>>> s = "a:b|c"
>>> s.translate(None, ":|")
'abc'

You may use re.sub
import re
re.sub(r'[:|]', "", vidName)

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

Combining several replace-statements into one in pandas­

There is a DataFrame in pandas, see image below
Basically it is a table scraped from Wikipedia's article: https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle
For further processing, I am trying to clean up the data. So, these statements work well
df['Name'] = df['Name'].str.replace('\d+', '')
df['Name'] = df['Name'].str.strip()
df['Name'] = df['Name'].str.replace(',', '')
df['Name'] = df['Name'].str.replace('­-', '')
But how can I bring all these four statements into one? Probably using regular expressions.
I tried with df['Name'] = df['Name'].str.replace(r'[\d\-,]+', '') but it did not work. Maybe because of the word wrap character that was used.
My desired output is " Ber,li-n2 "-> "Berlin".
The unknown circumstances are going around 'Mönchen­gladbach1, 5'.
You are removing the data, so you may join the patterns you remove into a single pattern like the one you have. r'[\d,-]+' is a bit better stylistically.
You may remove any dash punctuation + soft hyphen (\u00AD) using [\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D], so you may want to add these codes to the regex.
Remember to assign the cleaned data back to the column and add .str.stip().
You may use
df['Name'] = df['Name'].str.replace(r'[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+', '').str.strip()
If you do not want to add str.strip(), add ^\s+ and \s+$ alternatives to the regex:
df['Name'] = df['Name'].str.replace(r'^\s+|[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+|\s+$', '')
Details
^\s+ - 1+ whitespaces at the start of the string
| - or
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+ - 1 or more Unicode dashes, digits, commas or - chars
| - or
\s+$ - 1+ whitespaces at the end of the string.
You can go with
df['Name'] = df['Name'].str.replace('(\d+|,|­<|>|-)', '')
Put the items you want to sort out into a group, and seperate different options using the pipe |

How to trim right and left side a url?

I have list of websites unfortunately which looks like "rs--google.com--plain" how to remove 'rs--' and '--plain' from the url? I tried strip() but it didn't remove anything.
The way to remove "rs--" and "--plain" from that url (which is a string most likely) is to use some basic regex on it:
import re
url = 'rs--google.com--plain'
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
print(cleaned_url)
Which prints out:
google.com
What is done here is use re's search module to check if anything exists between "rs--" and "--plain" and if it does match it to group 1, we then check for group 1 by doing .group(1) and set our entire "cleaned url" to it:
cleaned_url = re.search('rs--(.*)--plain', url).group(1)
And now we only "google.com" in our cleaned_url.
This assumes "rs--" and "--plain" are always in the url.
Updated to handle any letters on either side of --:
import re
url = 'po--google.com--plain'
cleaned_url = re.search('[A-z]+--(.*)--[A-z]+', url).group(1)
print(cleaned_url)
This will handle anything that has letters before -- and after -- and get only the url in the middle. What that does is check any letters on either side of -- regardless of how many letters are there. This will allow queries with letters that match that regular expression so long as --myurl.com-- letters exist before the first "--" and after the second "--"
A great resource for working on regex is regex101
You can use replace function in python.
>>> val = "rs--google.com--plain"
>>> newval =val.replace("rs--","").replace("--plain","")
>>> newval
'google.com'

How to remove/delete characters from end of string that match another end of string

I have thousands of strings (not in English) that are in this format:
['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
I want to return the following:
['MyWordMyWordSuffix', 'SameVocabularyItem']
Because strings are immutable and I want to start the matching from the end I keep confusing myself on how to approach it.
My best guess is some kind of loop that starts from the end of the strings and keeps checking for a match.
However, since I have so many of these to process it seems like there should be a built in way faster than looping through all the characters, but as I'm still learning Python I don't know of one (yet).
The nearest example I could find already on SO can be found here but it isn't really what I'm looking for.
Thank you for helping me!
You can use commonprefix from os.path to find the common suffix between them:
from os.path import commonprefix
def getCommonSuffix(words):
# get common suffix by reversing both words and finding the common prefix
prefix = commonprefix([word[::-1] for word in words])
return prefix[::-1]
which you can then use to slice out the suffix from the second string of the list:
word_list = ['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
suffix = getCommonSuffix(word_list)
if suffix:
print("Found common suffix:", suffix)
# filter out suffix from second word in the list
word_list[1] = word_list[1][0:-len(suffix)]
print("Filtered word list:", word_list)
else:
print("No common suffix found")
Output:
Found common suffix: MyWordSuffix
Filtered word list: ['MyWordMyWordSuffix', 'SameVocabularyItem']
Demo: https://repl.it/#glhr/55705902-common-suffix

Searching for strings in a 'dictionary' file with multiple wildcard values

I am trying to create a function which will take 2 parameters. A word with wildcards in it like "*arn*val" and a file name containing a dictionary. It returns a list of all words that match the word like ["carnival"].
My code works fine for anything with only one "*" in it, however any more and I'm stumped as to how to do it.
Just searching for the wildcard string in the file was returning nothing.
Here is my code:
dictionary_file = open(dictionary_filename, 'r')
dictionary = dictionary_file.read()
dictionary_file.close()
dictionary = dictionary.split()
alphabet = ["a","b","c","d","e","f","g","h","i",
"j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]
new_list = []
for letter in alphabet:
if wildcard.replace("*", letter) in dictionary:
new_list += [wildcard.replace("*", letter)]
return new_list
The parameters parameters: First is the wildcard string (wildcard), and second is the dictionary file name (dictionary_filename).
Most answers on this site were about Regex, which I have no knowledge of.
Your particular error is that .replace replaces all occurrences e.g., "*arn*val" -> "CarnCval" or "IarnIval". You want different letters here. You could use the second nested loop over the alphabet (or use itertools.product() to generate all possible letter pairs) to fix it but a simpler way is to use regular expressions:
import re
# each `*` corresponds to an ascii lowercase letter
pattern = re.escape(wildcard).replace("\\*", "[a-z]")
matches = list(filter(re.compile(pattern+"$").match, known_words))
Note: it doesn't support escaping * in the wildcard.
If input wildcards are file patterns then you could use fnmatch module to filter words:
import fnmatch
matches = fnmatch.filter(known_words, wildcard)

Resources