Combining several replace-statements into one in pandas­ - python-3.x

There is a DataFrame in pandas, see image below
Basically it is a table scraped from Wikipedia's article: https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle
For further processing, I am trying to clean up the data. So, these statements work well
df['Name'] = df['Name'].str.replace('\d+', '')
df['Name'] = df['Name'].str.strip()
df['Name'] = df['Name'].str.replace(',', '')
df['Name'] = df['Name'].str.replace('­-', '')
But how can I bring all these four statements into one? Probably using regular expressions.
I tried with df['Name'] = df['Name'].str.replace(r'[\d\-,]+', '') but it did not work. Maybe because of the word wrap character that was used.
My desired output is " Ber,li-n2 "-> "Berlin".
The unknown circumstances are going around 'Mönchen­gladbach1, 5'.

You are removing the data, so you may join the patterns you remove into a single pattern like the one you have. r'[\d,-]+' is a bit better stylistically.
You may remove any dash punctuation + soft hyphen (\u00AD) using [\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D], so you may want to add these codes to the regex.
Remember to assign the cleaned data back to the column and add .str.stip().
You may use
df['Name'] = df['Name'].str.replace(r'[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+', '').str.strip()
If you do not want to add str.strip(), add ^\s+ and \s+$ alternatives to the regex:
df['Name'] = df['Name'].str.replace(r'^\s+|[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+|\s+$', '')
Details
^\s+ - 1+ whitespaces at the start of the string
| - or
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+ - 1 or more Unicode dashes, digits, commas or - chars
| - or
\s+$ - 1+ whitespaces at the end of the string.

You can go with
df['Name'] = df['Name'].str.replace('(\d+|,|­<|>|-)', '')
Put the items you want to sort out into a group, and seperate different options using the pipe |

Related

How to remove leading spaces from strings in a dataseries/list?

I am doing a network analysis via networks and noticed that some of the nodes are being treated differently just because they have extra spaces (leading).
I tried to remove the spaces using the following codes but I cannot seem to make the output become strings again.
rhedge = pd.read_csv(r"final.edge.csv")
rhedge
_________________
source | to
niala | Sana, Sana
Wacko | Ana, Aisa
rhedge['to'][1]
'Sana, Sana'
rhedge['splitted_users2'] = rhedge['to'].apply(lambda x:x.split(','))
#I need to split them so they will be included as different nodes
The problem is with the next code
rhedge['splitted_users2'][1]
['Sana', ' Sana']
As you can see the second Sana has a leading space.
I tried to do this:
split_users = []
for i in split:
row = [x.strip() for x in i]
split_users.append(row)
pd.Series(split_users)
But when I am trying to split them by "," again, it won't allow me because the dataset is now list. I believe that splitting them would make networks treat them as one node as opposed to creating a different node for one with a leading space.
THANK YOU
Changing the lambda expression
import pandas pd
# dataframe creation
df = pd.DataFrame({'source': ['niala', 'Wacko'], 'to': ['Sana, Sana', 'Ana, Aisa']})
# split and strip with a list comprehension
df['splitted_users2'] = df['to'].apply(lambda x:[y.strip() for y in x.split(',')])
print(df['splitted_users2'][0])
>>> ['Sana', 'Sana']
Alternatively
Option 1
Split on ', ' instead of ','
df['to'] = df['to'].str.split(', ')
Option 2
Replace ' ' with '' and then split on ','
This has the benefit of removing any whitespace around either name (e.g. [' Sana, Sana', ' Ana, Aisa'])
df['to'] = df['to'].str.replace(' ', '').str.split(',')
If you want the names split into separate columns, see SO: Pandas split column of lists into multiple columns

How to find special characters from Python Data frame

I need to find special characters from entire dataframe.
In below data frame some columns contains special characters, how to find the which columns contains special characters?
Want to display text for each columns if it contains special characters.
You can setup an alphabet of valid characters, for example
import string
alphabet = string.ascii_letters+string.punctuation
Which is
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
And just use
df.col.str.strip(alphabet).astype(bool).any()
For example,
df = pd.DataFrame({'col1':['abc', 'hello?'], 'col2': ['ÃÉG', 'Ç']})
col1 col2
0 abc ÃÉG
1 hello? Ç
Then, with the above alphabet,
df.col1.str.strip(alphabet).astype(bool).any()
False
df.col2.str.strip(alphabet).astype(bool).any()
True
The statement special characters can be very tricky, because it depends on your interpretation. For example, you might or might not consider # to be a special character. Also, some languages (such as Portuguese) may have chars like ã and é but others (such as English) will not.
To remove unwanted characters from dataframe columns, use regex:
def strip_character(dataCol):
r = re.compile(r'[^a-zA-Z !##$%&*_+-=|\:";<>,./()[\]{}\']')
return r.sub('', dataCol)
df[resultCol] = df[dataCol].apply(strip_character)
# Whitespaces also could be considered in some cases.
import string
unwanted = string.ascii_letters + string.punctuation + string.whitespace
print(unwanted)
# This helped me extract '10' from '10+ years'.
df.col = df.col.str.strip(unwanted)

Python 3: Removing u200b (zwsp) and newlines (\n) and spaces - chaining List operations?

I'm really stumped as to why this doesn't work. All I want to do is removing zwsp (u200b), and newlines and extra spaces from content read from a file.
Ultimately, I want to write this out to a new file, which I have functional, just not in the desired format yet.
My input (a short test file, which has zwsp / u200b in it) consists of the following:
Australia 1975
​Adelaide ​ 2006 ​ 23,500
Brisbane (Logan) 2006 29,700
​Brisbane II (North Lakes) ​ 2016 ​ 29,000
Austria 1977
Graz 1989 26,100
Innsbruck 2000 16,000
Klagenfurt 2008 27,000
My code so is as follows:
input_file = open('/home/me/python/info.txt', 'r')
file_content = input_file.read()
input_file.close()
output_nospace = file_content.replace('\u200b' or '\n' or ' ', '')
print(output_nospace)
f = open('nospace_u200b.txt', 'w')
f.write(output_nospace)
f.close()
However, this doesn't work as I expect.
Whilst it removes u200b, it does not remove newlines or spaces. I have to test for absence of u200b by checking the output file produced as part of my script.
If I remove one of the operations, e.g. /u200b, like so:
output_nospace = file_content.replace('\n' or ' ', '')
...then sure enough the resulting file is without newlines or spaces, but u200b remains as expected. Revert back to the original described at the top of this post, and it doesn't remove u200b, newlines and spaces.
Can anyone advise what I'm doing wrong here? Can you chain list operations like this? How can I get this to work?
Thanks.
The result of code like "a or b or c" is just the first thing of a, b, or c that isn't considered false by Python (None, 0, "", [], and False are some false values). In this case the result is the first value, the zwsp character. It doesn't convey to the replace function that you're looking to replace a or b or c with ''; the replace code isn't informed you used 'or' at all. You can chain replacements like this, though: s.replace('a', '').replace('b', '').replace('c', ''). (Also, replace is a string operation, not a list operation, here.)
Based on this question, I'd suggest a tutorial like learnpython.org. Statements in Python or other programming languages are different from human-language sentences in ways that can confuse you when you're just starting out.
As indicated by #twotwotwo, the following implementation of a .replace chain solves the issue.
output_nospace = \
file_content.replace('\u200b', '').replace('\n', '').replace(' ', '')
Thanks so much for pointing me in the right direction. :)

str.format places last variable first in print

The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.
At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.

Alternative to .replace() for replacing multiple substrings in a string

Are there any alternatives that are similar to .replace() but that allow you to pass more than one old substring to be replaced?
I have a function with which I pass video titles so that specific characters can be removed (because the API I'm passing the videos too has bugs that don't allow certain characters):
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
bugFixVidName = vidName.replace(":", "")
search_url ='https://api.brightcove.com/services/library?command=search_videos&video_fields=name&page_number=0&get_item_count=true&token=kwSt2FKpMowoIdoOAvKj&any=%22{}%22'.format(bugFixVidName)
Right now, it's eliminating ":" from any video titles with vidName.replace(":", "") but I also would like to replace "|" when that occurs in the name string sorted in the vidName variable. Is there an alternative to .replace() that would allow me to replace more than one substring at a time?
>>> s = "a:b|c"
>>> s.translate(None, ":|")
'abc'
You may use re.sub
import re
re.sub(r'[:|]', "", vidName)

Resources