Python: Using Pandas and Regex to Clean Phone Numbers with Country Code - python-3.x

I'm attempting to use pandas to clean phone numbers so that it returns only the 10 digit phone number and removes the country code if it is present and any special characters.
Here's some sample code:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
Returns
0 11921674056
1 1233454568
2 1233455678
3 1231231234
As you can see, this regex works well except for the country code. And unfortunately, this system I'm loading into cannot accept the country code. What I'm struggling with, is finding a regex that with strip the country code as well. All the regex's I've found will match the 10 digits I need, and in this case with using pandas, I need to not match them.
I could easily write a function and use .apply but I feel like there is likely a simple regex solution that I'm missing.
Thanks for any help!

I don't think regex is necessary here, which is nice because regex is a pain in the buns.
To append your current solution:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
phone_series = phone_series.apply(lambda x: x[-10:])
My lazier solution:
>>> phone_series = pd.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
>>> p2 = phone_series.apply(lambda x: ''.join([i for i in x if str.isnumeric(i)])[-10:])
>>> p2
0 1921674056
1 1233454568
2 1233455678
3 1231231234
dtype: object

Related

How to use regex in Pandas?

I have extracted the column names from a .csv file and now I want to use a regex expression in order to capitalise the first letter of the word and the first letter after the _ character.
Example: loan_status -> Loan_Status
Loan_ID
loan_status
Principal
terms
effective_date
due_date
paid_off_time
past_due_days
age
education
Gender
This is what I have come up with so far (^[a-z])+\w+
UPDATE
Thanks to Wiktor Stribiżew, this is what I came up with.
I am wondering if there is a more compact way to do the below.
import csv
import pandas as pd
import re
dataFrame = pd.read_csv('Loan_payments_data_2020_unclean.csv')
columnsDict = {"columnName": list(dataFrame.columns)}
columnsDataFrame = pd.DataFrame(columnsDict)
replacedColumns = columnsDataFrame['columnName'].str.replace(r'(?<![^_]).', lambda x: x.group().upper())
dataFrame.columns = list(replacedColumns)
print(dataFrame)
You may use
>>> df = pd.DataFrame({'Loan_ID': ['loan_status','Principal','terms','effective_date','due_date','paid_off_time','past_due_days','age','education','Gender']})
>>> df['Loan_ID'].str.replace(r'(?<![^_]).', lambda x: x.group().upper())
0 Loan_Status
1 Principal
2 Terms
3 Effective_Date
4 Due_Date
5 Paid_Off_Time
6 Past_Due_Days
7 Age
8 Education
9 Gender
Name: Loan_ID, dtype: object
The (?<![^_]). regex matches any char other than line break char that is either at the start of string or appears immediately after a _ char. It is equal to (?:(?<=^)|(?<=_)). regex, see its demo online.
Since you cannot manipulate the matched value from within a string replacement pattern, a callable is required as the replacement argument. lambda x: x.group().upper() just grabs the match value and turns it to upper case.

How to split compound word in pandas?

I have and document that consist of many compounds (or sometimes combined) word as:
document.csv
index text
0 my first java code was helloworld
1 my cardoor is totally broken
2 I will buy a screwdriver to fix my bike
As seen above some words are combined or compound and I am using compound word splitter from here to fix this issue, however, I have trouble to apply it in each row of my document (like pandas series) and convert the document into a clean form of:
cleanDocument.csv
index text
0 my first java code was hello world
1 my car door is totally broken
2 I will buy a screw driver to fix my bike
(I am aware of word such as screwdriver should be together, but my goal is cleaning the document). If you have a better idea for splitting only combined words, please let me know.
splitter code may works as:
import pandas as pd
import splitter ## This use enchant dict (pip install enchant requires)
data = pd.read_csv('document.csv.csv')
then, it should use:
splitter.split(data) ## ???
I already looked into something like this but this not work in my case. thanks
You use apply wit axis =1 : Can you try the following
data.apply(lambda x: splitter.split(j) for j in (x.split()), axis = 1)
I do not have splitter installed on my system. By looking at the link you have provided, I have this following code. Can you try:
def handle_list(m):
ret_lst = []
L = m['text'].split()
for wrd in L:
g = splitter.split(wrd)
if g :
ret_lst.extend(g)
else:
ret_lst.append(wrd)
return ret_lst
dft.apply(handle_list, axis = 1)

List.Sort() method places single digit last however non single digits are placed first python [duplicate]

I have some files that need to be sorted by name, unfortunately I can't use a regular sort, because I also want to sort the numbers in the string, so I did some research and found that what I am looking for is called natural sorting.
I tried the solution given here and it worked perfectly.
However, for strings like PresserInc-1_10.jpg and PresserInc-1_11.jpg which causes that specific natural key algorithm to fail, because it only matches the first integer which in this case would be 1 and 1, and so it throws off the sorting. So what I think might help is to match all numbers in the string and group them together, so if I have PresserInc-1_11.jpg the algorithm should give me 111 back, so my question is, is this possible ?
Here's a list of filenames:
files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
Google: Python natural sorting.
Result 1: The page you linked to.
But don't stop there!
Result 2: Jeff Atwood's blog that explains how to do it properly.
Result 3: An answer I posted based on Jeff Atwood's blog.
Here's the code from that answer:
import re
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(l, key=alphanum_key)
Results for your data:
PresserInc-1.jpg
PresserInc-1_10.jpg
PresserInc-1_11.jpg
PresserInc-2.jpg
PresserInc-3.jpg
etc...
See it working online: ideone
If you don't mind third party libraries, you can use natsort to achieve this.
>>> import natsort
>>> files = ['PresserInc-1.jpg', 'PresserInc-1_10.jpg', 'PresserInc-1_11.jpg', 'PresserInc-10.jpg', 'PresserInc-2.jpg', 'PresserInc-3.jpg', 'PresserInc-4.jpg', 'PresserInc-5.jpg', 'PresserInc-6.jpg', 'PresserInc-11.jpg']
>>> natsort.natsorted(files)
['PresserInc-1.jpg',
'PresserInc-1_10.jpg',
'PresserInc-1_11.jpg',
'PresserInc-2.jpg',
'PresserInc-3.jpg',
'PresserInc-4.jpg',
'PresserInc-5.jpg',
'PresserInc-6.jpg',
'PresserInc-10.jpg',
'PresserInc-11.jpg']

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!
Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz
Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

Pandas select rows where a value in a columns does not starts with a string

I have a data where I need to filter out any rows that do start with a certain values - emphasis on plural:
Below the data exactly as it appears in file data.xlsx
Name Remains
GESDSRPPZ0161 TRUE
RT6000996 TRUE
RT6000994 TRUE
RT6000467 TRUE
RT6000431 TRUE
MCOPSR0034 FALSE
MCOPSR0033 FALSE
I need to be able to return a dataframe where name DOES NOT start with MCO, GE,etc.
import pandas as pd
import numpy as np
### data
file = r'C:\Users\user\Desktop\data.xlsx'
data = pd.read_excel(file, na_values = '')
data['name'] = data['name'].str.upper()
prefixes = ['IM%','JE%','GE%','GV%','CHE%','MCO%']
new_data = data.select(lambda x: x not in prefixes)
new_data.shape
the last call returns exactly the same dataset as I started with.
I tried:
pandas select from Dataframe using startswith
but it excludes data if the string is elsewhere (not only starts with)
df = df[df['Column Name'].isin(['Value']) == False]
The above answer would work if I knew exactly the string in question, however it changes (the common part is MCOxxxxx, GVxxxxxx, GExxxxx...)
The vvery same happens with this one:
How to implement 'in' and 'not in' for Pandas dataframe
because the values I have to pass have to be exact. Is there any way to do with using the same logic as here (Is there any equivalent for wildcard characters like SQL?):
How do I select rows where a column value starts with a certain string?
Thanks for help! Can we expand please on the below?
#jezrael although I've chosen the other solution for simplicity (and my lack of understanding of your solution), but I'd like to ask for a bit of explanation please. What does '^' + '|^' do in this code and how is it different from Wen's solution? How does it compare performance wise when you have for loop construct as oppose to operation on series like map or apply? If I understand correctly contains() is not bothered with the location whereby startswith() specifically looks at the beggining of the string. Does it mean the
^indicates to contains() to do what? Start at the beginning?
And | is it another special char for the method or is it treated like logical OR? I'd really want to learn this if you don't mind sharing. Thanks
You can using startswith , the ~ in the front will convert from in to not in
prefixes = ['IM','JE','GE','GV','CHE','MCO']
df[~df.Name.str.startswith(tuple(prefixes))]
Out[424]:
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Use str.contains with ^ for start of string and filter by boolean indexing:
prefixes = ['IM','JE','GE','GV','CHE','MCO']
pat = '|'.join([r'^{}'.format(x) for x in prefixes])
df = df[~df['Name'].str.contains(pat)]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Thanks, #Zero for another solution:
df = df[~df['Name'].str.contains('^' + '|^'.join(prefixes))]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True

Resources