How to combine multiple rows of pandas dataframe into one between two other row values python3? - python-3.x

I have a pandas dataframe with a single column that contains name, address, and phone info separated by blank or na rows like this:
data
0 Business name one
1 1234 address ln
2 Town, ST 55655
3 (555) 555-5555
4 nan
5 Business name two
6 5678 address dr
7 New Town, ST 55677
8 nan
9 Business name three
10 nan
and so on...
What I want is this:
Name Addr1 Addr2 Phone
0 Business name one 1234 address ln Town, ST 55655 (555) 555-5555
1 Business name two 5678 address dr New Town, ST 55677
2 Business name three
I am using python 3 and have been stuck, any help is much appreciated!

You can use:
create groups for each row with isnull and cumsum
for align with non NaN rows add reindex
remove NaNs by dropna, set_index to MultiIndex with cumcount
reshape by unstack
a = df['data'].isnull().cumsum().reindex(df.dropna().index)
print (a)
0 0
1 0
2 0
3 0
5 1
6 1
7 1
9 2
Name: data, dtype: int32
df = df.dropna().set_index([a, a.groupby(a).cumcount()])['data'].unstack()
df.columns = ['Name','Addr1','Addr2','Phone']
print (df)
Name Addr1 Addr2 Phone
data
0 Business name one 1234 address ln Town, ST 55655 (555) 555-5555
1 Business name two 5678 address dr New Town, ST 55677 None
2 Business name three None None None
If there is multiple address is possible create columns dynamically:
df.columns = ['Name'] +
['Addr{}'.format(x+1) for x in range(len(df.columns) - 2)] +
['Phone']

df['group']=df['data'].str.contains('Business').cumsum().replace({True:1}).ffill()
df1=df.groupby('group')['data'].apply(list).apply(pd.Series).dropna(axis=1,thresh =1)
df1.columns=['Name','Addr1','Addr2','Phone']
df1
Out[1221]:
Name Addr1 Addr2 \
group
1.0 Business name one 1234 address ln Town, ST 55655
2.0 Business name two 5678 address dr New Town, ST 55677
3.0 Business name three NaN NaN
Phone
group
1.0 (555) 555-5555
2.0 NaN
3.0 NaN

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

how to remove rows from a pandas dataframe if two rows contains at least one matching element

i have a pandas dataframe contains many columns like Name, Email, Mobile Number etc. . which looks like this :
Sr No. Name Email Mobile Number
1. John joh***#gmail.com 1234567890,2345678901
2. kylie k.ki**#yahoo.com 6789012345
3. jon null 1234567890
4. kia kia***#gmail.com 6789012345
5. sam b.sam**#gmail.com 4567890123
I want to remove the rows which contains same Mobile Number. One person can have more than one number. I done this through drop_duplicates function. I tried this:
newdf = df.drop_duplicates(subset = ['Mobile Number'],keep=False)
Here is output :
Sr No. Name Email Mobile Number
1. John joh***#gmail.com 1234567890,2345678901
3. jon null 1234567890
5. sam b.sam**#gmail.com 4567890123
But the problem is it only removes the rows which are exactly same. but i want to remove the row which contains at least one same number i.e Sr. No. 1 and 3 they have one same number. How can i remove them so the final output looks like this :
final output:
Sr No. Name Email Mobile Number
5. sam b.sam**#gmail.com 4567890123
Alright. It is a complicated solution but I was able to solve for it.
Here's how I am doing it.
First, I take all the mobile numbers and split them by ,. Then I explode them (it will retain same index).
Then find all the index of rows with duplicates.
Then exclude the rows from the dataframe if the index was part of the duplicate
This will give you the unique rows that do not have any duplicates.
I modified your dataframe to have a few options.
c = ['Name','Email','Mobile Number']
d = [['John','joh***#gmail.com','1234567890,2345678901,6789012345'],
['kylie','k.ki**#yahoo.com','6789012345'],
['jon','null','1234567890'],
['kia','kia***#gmail.com','6789012345'],
['mac','mac***#gmail.com','2345678901,1098765432'],
['kfc','kfc***#gmail.com','6237778901,1098765432,3034045050'],
['pig','pig***#gmail.com','8007778001,8018765454,5054043030'],
['bil','bil***#gmail.com','1098765432'],
['jun','jun***#gmail.com','9098785434'],
['sam','b.sam**#gmail.com','4567890123']]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print (df)
temp = df.copy()
temp['Mobile Number'] = temp['Mobile Number'].apply(lambda x: x.split(','))
temp = temp.explode('Mobile Number')
#print (temp)
df2 = df[~df.index.isin(temp[temp['Mobile Number'].duplicated(keep=False)].index)]
print (df2)
The output of this is:
Original DataFrame:
Name Email Mobile Number
0 John joh***#gmail.com 1234567890,2345678901,6789012345 # duplicated index: 1, 2,3, 4
1 kylie k.ki**#yahoo.com 6789012345 # duplicated index: 0, 3
2 jon null 1234567890 # duplicated index: 0
3 kia kia***#gmail.com 6789012345 # duplicated index: 0
4 mac mac***#gmail.com 2345678901,1098765432 # duplicated index: 0
5 kfc kfc***#gmail.com 6237778901,1098765432,3034045050 # duplicated index: 7
6 pig pig***#gmail.com 8007778001,8018765454,5054043030 # no duplicate; should output
7 bil bil***#gmail.com 1098765432 # duplicated index: 5
8 jun jun***#gmail.com 9098785434 # no duplicate; should output
9 sam b.sam**#gmail.com 4567890123 # no duplicate; should output
The output of this will be the 3 rows (index: 6, 8, and 9):
Name Email Mobile Number
6 pig pig***#gmail.com 8007778001,8018765454,5054043030
8 jun jun***#gmail.com 9098785434
9 sam b.sam**#gmail.com 4567890123
Since temp is not needed anymore, you can just delete it using del temp.
One possible solution is to do the following. Say your df is given by
Sr No. Name Email Mobile Number
0 1.0 John joh***#gmail.com 1234567890 , 2345678901
1 2.0 kylie k.ki**#yahoo.com 6789012345
2 3.0 jon NaN 1234567890
3 4.0 kia kia***#gmail.com 6789012345
4 5.0 sam b.sam**#gmail.com 4567890123
You can split your Mobile Number column into two (or more) columns mob1, mob2,.... and then drop duplicates
df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True)
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
which returns
Sr No. Name Email Mobile Number mob1 mob2
4 5.0 sam b.sam**#gmail.com 4567890123 4567890123 None
EDIT
To handle the possible swapped order of numbers, one can extend the method by dropping duplicates from all created columns:
df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True)
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
newdf = df.drop_duplicates(subset = ['mob2'],keep=False)
which returns:
Sr No. Name Email Mobile Number mob1 \
0 1.0 John joh***#gmail.com 2345678901 , 1234567890 2345678901
mob2
0 1234567890
If there are individuals with more than two number then as many columns as the maximum number of phone numbers need to be created.

Pandas Drop an Entire Column if All of the Values equal a Certain Value

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

How to split a Dataframe column whose data is not unique

I have a column called users in dataframe which doesn't have a unique format. I am doing a data cleanup project as the data looks unreadable.
company Users
A [{"Name":"Martin","Email":"name_1#email.com","EmpType":"Full"},{"Name":"Rick","Email":"name_2#email.com","Dept":"HR"}]
B [{"Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales" }]
I used the below query to this has broke down the data frame as below
df2 = df
df2 = df2.join(df['Users_config'].str.split('},{', expand=True).add_prefix('Users'))
company Users0 Users1
A "Name":"Martin","Email":"name_1#email.com","EmpType":"Full" "Name":"Rick","Email":"name_2#email.com","Dept":"HR"
B "Name":"John","Email":"name_2#email.com","EmpType":"Full","Dept":"Sales"
and further breaking the above df with "," using the same query I got the output as
Company Users01 Users02 Users03 Users10 Users11 Users12
1 "Name":"Martin" "Email":"name_1#email.com" "EmpType":"Full" "Name":"Rick" "Email":"name_2#email.com" "Dept":"HR"
2 "Name":"John" "Email":"name_2#email.com" "EmpType":"Full" "Dept":"Sales"
As this dataframe looks messy I want to get the output as below. I feel the best way to name the column is to use the column value "Name" from "Name":"Martin" itself and If we hardcore using df.rename the column name will get mismatch.
Company Name_1 Email_1 EmpType_1 Dept_1 Name_2 Email_2 Dept_2
1 Martin name_1#email.com Full Rick name_2#email.com "HR"
2 John name_2#email.com" Full Sales
Is there any way I can get the above output from the original dataframe.
Use:
df['Users'] = df['Users'].apply(ast.literal_eval)
d = df.explode('Users').reset_index(drop=True)
d = d.join(pd.DataFrame(d.pop('Users').tolist()))
d = d.set_index(['company', d.groupby('company').cumcount().add(1).astype(str)]).unstack()
d.columns = d.columns.map('_'.join)
Details:
First we use ast.literal_eval to evaluate the strings in Users column, then use DataFrame.explode on column Users to create a dataframe d.
print(d)
company Users
0 A {'Name': 'Martin', 'Email': 'name_1#email.com', 'EmpType': 'Full'}
1 A {'Name': 'Rick', 'Email': 'name_2#email.com', 'Dept': 'HR'}
2 B {'Name': 'John', 'Email': 'name_2#email.com', 'EmpType': 'Full', 'Dept': 'Sales'}
Create a new dataframe from the Users column in d and use DataFrame.join to join this new dataframe with d.
print(d)
company Name Email EmpType Dept
0 A Martin name_1#email.com Full NaN
1 A Rick name_2#email.com NaN HR
2 B John name_2#email.com Full Sales
Use DataFrame.groupby on column company then use groupby.cumcount to create a counter for each group, then use DataFrame.set_index to set the index of d as company + counter. Then use DataFrame.unstack to reshape the dataframe creating MultiIndex columns.
print(d)
Name Email EmpType Dept
1 2 1 2 1 2 1 2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN
Finally use map along with .join to flatten the MultiIndex columns.
print(d)
Name_1 Name_2 Email_1 Email_2 EmpType_1 EmpType_2 Dept_1 Dept_2
company
A Martin Rick name_1#email.com name_2#email.com Full NaN NaN HR
B John NaN name_2#email.com NaN Full NaN Sales NaN

Resources