Replacing a substring AFTER a character in a python pandas dataframe - string

I'm new to pandas and am having a lot of trouble with this and haven't found a solution, despite my searches. Hoping one of you can help me.
I have a pandas dataframe that has a column of emails that I'm trying to clean up. Some examples are:
>>> email['EMAIL']
0 testing#...com
1 NaN
2 I.am.ME#GAMIL.COM
3 FIRST.LAST.NAME#MAIL.CMO
4 EMAIL+REMOVE#TESTING.COM
Name: EMAIL, dtype: object
There are a number of things I'm trying to do here:
1) replace misspelled endings (e.g. CMO) with correct spellings (e.g. COM)
2) replace misspelled domain names with correct spellings
3) replace multiple periods with just 1 period AFTER the '#' symbol.
4) remove all periods before the '#' sign if they have a gmail account
5) remove all characters after the "+" symbol up to the '#' symbol
So, from the example above I would have returned:
>>> email['EMAIL']
0 testing#.com
1 NaN
2 IamME#GMAIL.COM
3 FIRST.LAST.NAME#MAIL.COM
4 EMAIL#TESTING.COM
Name: EMAIL, dtype: object
I've worked on a number of different codes and keep running into errors. Here's one of my best guesses so far, for removing multiple periods after the '#' symbol.
def remove_periods(email):
email_split = email['EMAIL'].str.split('#')
ending = email_split.str.get(-1)
ending = ending.str.replace('\.{2,}', '.')
emailupdate = email_split.str[:-1]
emailupdate.append(ending)
email_split.str.get()
return '#'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
I could print the multiple other versions too, but they all returns errors too.
Thanks a lot for the help!

import numpy as np
import pandas as pd
pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
'testing#...com', np.nan, 'I.am.ME#GAMIL.COM', 'FIRST.LAST.NAME#MAIL.CMO',
'EMAIL+REMOVE#TESTING.COM', 'gamil#bar...com', 'noperiods#localhost']})
email[['NAME', '#', 'ADDR']] = email['EMAIL'].str.rpartition('#')
# 1) replace misspelled endings (e.g. COM) with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '#' symbol.
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '#' sign if they have a gmail account
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '#' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')
# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['#'] + email['ADDR']
# clean up intermediate columns
# del email[['NAME', '#', 'ADDR']]
print(email)
yields
EMAIL NAME # ADDR NEW_EMAIL
0 testing#...com testing # .com testing#.com
1 NaN NaN None None NaN
2 I.am.ME#GAMIL.COM IamME # GMAIL.COM IamME#GMAIL.COM
3 FIRST.LAST.NAME#MAIL.CMO FIRST.LAST.NAME # MAIL.COM FIRST.LAST.NAME#MAIL.COM
4 EMAIL+REMOVE#TESTING.COM EMAIL # TESTING.COM EMAIL#TESTING.COM
5 gamil#bar...com gamil # bar.com gamil#bar.com
6 noperiods#localhost noperiods # localhost noperiods#localhost
The NAME column holds everything before the last #
The ADDR column holds everything after the last #.
I left the NAME, ADDR columns visible (and did not overwrite the original EMAIL column)
so it would be easier to understand the intermediate steps.

Related

Is there a python coding that can access and change the cell's alphabet to its opposite from in excel?

I'm new to python and I need to make a program that changes the letter's in the cell to the opposite form and also know the amount of names in the column and which row the name list is at so that it can change all of the names. The code is for me to be able to change the names without to ever look at the name list due to privacy reasons. I'm currently using Pycharm and Openpyxl if anyone is wondering. The picture shows the before and after of how it should look like. I have done a few tries but after that, I just can't seem to get any ideas on how to change the alphabet. I also tried the replacement (replacement = {'Danial' = 'Wzmrzo'}) function however I am required to look at the name list and then be able to change the letters.
import openpyxl
from openpyxl import Workbook, load_workbook
from openpyxl.utils import get_column_letter
print("Type the file name:")
DF = input()
wb = load_workbook(DF + '.xlsx')
print("Sheet Name:")
sht = input()
ws = wb[sht]
NC = str(input("Where is the Name Column?"))
column = ws[ NC ]
column_list = [column[x].value for x in range(len(column))]
print(column_list)
wb.save(DF + '.xlsx')
Before
After
Warning I'm not too familiar with openpyxl and how they access rows/cols but it seems to have changed a lot in the last few years. So this should give you an idea for how to make it work but might not work exactly as written depending on your version.
To find the name column you could use
name_col = False
# loop along the top row looking for "Name"
for i,x in enumerate(ws.iter_cols(max_row=1)):
if x[0].value == "Name":
name_col = i + 1 # enumerate is 0 indexed, excel rows/cols are 1 indexed
break
if name_col:
# insert name changing code here
else:
print("'Name' column not found.")
To change the names you could use (insert this in the code above)
# loop down name column
for i,x in enumerate(ws.iter_rows(min_col = name_col, max_col = name_col)):
# we need to skip the header row so
if i == 0:
continue
name = x[0].value
new_name = ""
for c in name:
# ord() gets the ASCII value of the char, manipulates it to be the opposite then uses chr() to get back the character
if ord(c) > 90:
new_c = chr(25 - (ord(c) - 97) + 97)
else:
new_c = chr(25 - (ord(c) - 65) + 65)
new_name.append(new_c)
ws.cell(row=i+1, column=name_col).value = new_name # enumerate is 0 indexed, excel rows/cols are 1 indexed hence i+1

Python and Regex: Problem with re findall()

This is a project found # https://automatetheboringstuff.com/2e/chapter7/
It searches text on the clipboard for phone numbers and emails then copy the results to the clipboard again.
If I understood it correctly, when the regular expression contains groups, the findall() function returns a list of tuples. Each tuple would contain strings matching each regex group.
Now this is my problem: the regex on phoneRegex as far as i can tell contains only 6 groups (numbered on the code) (so i would expect tuples of length 6)
But when I print the tuples i get tuples of length 9
('800-420-7240', '800', '-', '420', '-', '7240', '', '', '')
('415-863-9900', '415', '-', '863', '-', '9900', '', '', '')
('415-863-9950', '415', '-', '863', '-', '9950', '', '', '')
What am i missing?
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code (first group?)0
(\s|-|\.)? # separator 1
(\d{3}) # first 3 digits 2
(\s|-|\.) # separator 3
(\d{4}) # last 4 digits 4
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension 5
)''', re.VERBOSE)
# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+ # username
# # # symbol
[a-zA-Z0-9.-]+ # domain name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
print(groups)
phoneNum = '-'.join([groups[1], groups[3], groups[5]])
if groups[8] != '':
phoneNum += ' x' + groups[8]
matches.append(phoneNum)
for groups in emailRegex.findall(text):
matches.append(groups[0])
# Copy results to the clipboard.
if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('No phone numbers or email addresses found.')
Anything in parentheses will become a capturing group (and add one to the length of the re.findall tuple) unless you specify otherwise. To turn a sub-group into a non-capturing group, add ?: just inside the parentheses:
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(?:ext|x|ext.)\s*(?:\d{2,5}))? # <---
)''', re.VERBOSE)
You can see the extension part was adding two additional capturing groups. With this updated version, you will have 7 items in your tuple. There are 7 instead of 6 because the entire string is matched as well.
The regex could be better, too. This is cleaner and will match more cases with the re.IGNORECASE flag:
phoneRegex = re.compile(r'''(
(\(?\d{3}\)?)
([\s.-])?
(\d{3})
([\s.-])
(\d{4})
\s* # don't need to capture whitespace
((?:ext\.?|x)\s*(?:\d{1,5}))?
)''', re.VERBOSE | re.IGNORECASE)

How to remove leading spaces from strings in a dataseries/list?

I am doing a network analysis via networks and noticed that some of the nodes are being treated differently just because they have extra spaces (leading).
I tried to remove the spaces using the following codes but I cannot seem to make the output become strings again.
rhedge = pd.read_csv(r"final.edge.csv")
rhedge
_________________
source | to
niala | Sana, Sana
Wacko | Ana, Aisa
rhedge['to'][1]
'Sana, Sana'
rhedge['splitted_users2'] = rhedge['to'].apply(lambda x:x.split(','))
#I need to split them so they will be included as different nodes
The problem is with the next code
rhedge['splitted_users2'][1]
['Sana', ' Sana']
As you can see the second Sana has a leading space.
I tried to do this:
split_users = []
for i in split:
row = [x.strip() for x in i]
split_users.append(row)
pd.Series(split_users)
But when I am trying to split them by "," again, it won't allow me because the dataset is now list. I believe that splitting them would make networks treat them as one node as opposed to creating a different node for one with a leading space.
THANK YOU
Changing the lambda expression
import pandas pd
# dataframe creation
df = pd.DataFrame({'source': ['niala', 'Wacko'], 'to': ['Sana, Sana', 'Ana, Aisa']})
# split and strip with a list comprehension
df['splitted_users2'] = df['to'].apply(lambda x:[y.strip() for y in x.split(',')])
print(df['splitted_users2'][0])
>>> ['Sana', 'Sana']
Alternatively
Option 1
Split on ', ' instead of ','
df['to'] = df['to'].str.split(', ')
Option 2
Replace ' ' with '' and then split on ','
This has the benefit of removing any whitespace around either name (e.g. [' Sana, Sana', ' Ana, Aisa'])
df['to'] = df['to'].str.replace(' ', '').str.split(',')
If you want the names split into separate columns, see SO: Pandas split column of lists into multiple columns

Why is pyperclip not copying result of phone numbers to clipboard

I'm a beginner learning python with Automate The Boring Stuff by Al Sweigart.
I'm currently on the part where he created a program using Regular expression on how to extract emails and phone numbers from documents and have them pasted to another document.
Below is the script:
#! python3
import re
import pyperclip
# Create a regex for phone numbers
phoneRegex = re.compile(r'''
# 08108989212
(\d{11}) # Full phone number
''', re.VERBOSE)
#Create a regex for email a`enter code here`ddressess
emailRegex = re.compile(r'''
# some.+_thing#(\d{2,5}))?.com
[a-zA-Z0-9_.+] + # name part
# # #symbol
[a-zA-Z0-9_.+] + # domain name part
''', re.VERBOSE)
#Get the text off the clipboard
text = pyperclip.paste()
# TODO: Extract the email/phone from this text
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
print(extractedPhone)
print(extractedEmail)
# Copy the extracted email/phone to the clipboard
results = '\n'.join(allPhoneNumbers) + '\n' + '\n'.join(extractedEmail)
pyperclip.copy(results)
The script is expected to extract, print both phone numbers and email addresses to the terminal which it does. It is also expected to copy the extracted phone number and email addresses to the clipboard automatically, so they can be pasted to another text editor or word document.
Now the problem is, it copies only the email address but converts the phone numbers to 0 when pasted.
What am i not getting right?
Please pardon the errors in my English.
for library: phonenumbers (pypi, source)
Python version of Google's common library for parsing, formatting,
storing and validating international phone numbers.
I think you will need to use this to format those phone numbers.
To be more specific, you'll need to install the package using:
pip install phonenumbers
The main object that the library deals with is a PhoneNumber object. You can create this from a string representing a phone number using the parse function, but you also need to specify the country that the phone number is being dialled from (unless the number is in E.164 format, which is globally unique).
import phonenumbers
x = phonenumbers.parse("+442083661177", None)
print(x)
Country Code: 44 National Number: 2083661177 Leading Zero: False
type(x)
<class 'phonenumbers.phonenumber.PhoneNumber'>
y = phonenumbers.parse("020 8366 1177", "GB")
print(y)
Country Code: 44 National Number: 2083661177 Leading Zero: False
x == y
True
z = phonenumbers.parse("00 1 650 253 2222", "GB") # as dialled from GB, not a GB number
print(z)
Country Code: 1 National Number: 6502532222 Leading Zero(s): False
More information can be found here: https://pypi.org/project/phonenumbers/
The problem is you don't need this part of your code
allPhoneNumbers = []
for allPhoneNumber in extractedPhone:
allPhoneNumbers.append(allPhoneNumber[0])
all it does is to create list with first char (obviously always 0) from all extracted phone numbers.
Then change the result as follows:
results = '\n'.join(extractedPhone) + '\n' + '\n'.join(extractedEmail)

String manipulations using Python Pandas

I have some name and ethnicity data, for example:
John Wick English
Black Widow French
I then do a bit of manipulation to make the name as below
John Wick -> john#wick??????????????????????????????????
Black Widow -> black#widow????????????????????????????????
I then proceed into creating multiple variables and each contain the 3-character sub-strings through the for loop.
I also try to find the number of alphabets using the re.findall.
I have two questions:
1) Is the for loop efficient? Can I replace with better code even though it is working as is?
2) I can't get the code that tries to find the number of alphabet to work. Any suggestions?
import pandas as pd
from pandas import DataFrame
import re
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
substr = "substr" + str(i)
frame3[substr] = frame3["name_filled"].str[i-1:i+2]
# Count number of characters
frame3["name_len"] = len(re.findall('[a-zA-Z]', name))
# Test outputs
print frame3
!) Regarding the loop, I can't think of a better way than what you're already doing
2) Try frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))

Resources