pandas: groupby + store in another dataframe - python-3.x

I asked a similar question last week and now I have a similar issue, but I cannot convert the answer I received in this case.
Basically, I have a dataframe called comms which looks like this:
articleID Material commentScore
1234 News 0.75
1234 News -0.1
5678 Sport 1.33
5678 News 0.75
5678 Fashion 0.02
7412 Politics -3.45
and another dataframe called arts and it looks like this:
articleID wordCount byLine
1234 1524 John
5678 9824 Mary
7412 3713 Sam
I would like to simply count how many comms there are for each articleID, and store this number in a new column of the arts dataframe named commentNumber.
I think I have to use groupby, count() and maybe merge, but I can't figure out why.
Expected output
articleID wordCount byLine commentNumber
1234 1524 John 2
5678 9824 Mary 3
7412 3713 Sam 1
Thanks in advance!
Andrea

Use groupby() then count() on one column. At last, map the result with articleID columns of arts.
arts['commentNumber'] = arts['articleID'].map(comms.groupby('articleID')['Material'].count())
print(arts)
articleID wordCount byLine commentNumber
0 1234 1524 John 2
1 5678 9824 Mary 3
2 7412 3713 Sam 1

Use Series.map with Series.value_counts:
arts['commentNumber'] = arts['articleID'].map(comms['articleID'].value_counts())
print (arts)
articleID wordCount byLine commentNumber
0 1234 1524 John 2
1 5678 9824 Mary 3
2 7412 3713 Sam 1
Alternative:
from collections import Counter
arts['commentNumber'] = arts['articleID'].map(Counter(comms['articleID']))

Related

How update a dataframe column value from second dataframe where values on two specific columns that can repeat on first match on both dataframes?

I have two dataframes with different information about a person, on the first dataframe, person's name may repeat in different rows. I want to add/update the first dataframe with data from the second dataframe where the two columns containing person's data matches on both. Here an example on what I need to accomplish:
df1:
name surname
0 john doe
1 mary doe
2 peter someone
3 mary doe
4 john another
5 paul another
df2:
name surname account_id
0 peter someone 100
1 john doe 200
2 mary doe 300
3 john another 400
I need to accomplish this:
df1:
name surname account_id
0 john doe 200
1 mary doe 300
2 peter someone 100
3 mary doe 300
4 john another 400
5 paul another <empty>
Thanks!

Using Regex to change the name values format in a dataframe

I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.
df1 looks like this:
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
df2 looks similar:
data = {'Employee ID' : [12345, 23456, 34567],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])
My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.
Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?
IIUC, do you need?
df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)
Output:
Employee ID Values Employee Name Department Supervisor DS Corrected
0 12345 123168546543154 Jones, John Wendy Davis Davis, Wendy
1 23456 13513545435145434 Potter, Harry Albus Dumbledore Dumbledore, Albus
2 34567 556423145613 Watts, Wade James Halliday Halliday, James
Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Wendy Davis
1 23456 13513545435145434 Potter, Harry Albus Percival Wulfric Brian Dumbledore
2 34567 556423145613 Watts, Wade James Donovan Halliday
So we need to swap the last name to the front with...
import pandas as pd
data = {'Employee ID' : [12345, 23456, 34567],
'Values' : [123168546543154, 13513545435145434, 556423145613],
'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])
def swap_names(text):
first, *middle, last = text.split()
if len(middle) == 0:
return last + ', ' + first
else:
return last + ', ' + first + ' ' + ' '.join(middle)
df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]
print(df1)
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus Percival Wulfric Brian
2 34567 556423145613 Watts, Wade Halliday, James Donovan
Maybe...
df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]
Outputs:
Employee ID Values Employee Name Department Supervisor
0 12345 123168546543154 Jones, John Davis, Wendy
1 23456 13513545435145434 Potter, Harry Dumbledore, Albus
2 34567 556423145613 Watts, Wade Halliday, James

Text data massaging to conduct distance calculations in python

I am trying to get text data from dataframe "A" to be convereted to columns while text data from dataframe "B" to be in rows in a new dataframe "C" in order to calculate distance calculations.
Data in dataframe "A" looks like this
Unique -> header
'Amy'
'little'
'sheep'
'dead'
Data in dataframe "B" looks like this
common_words -> header
'Amy'
'George'
'Barbara'
i want the output in dataframe C as
Amy George Barbara
Amy
little
sheep
dead
Can anyone help me on this
What should be the actual content of data frame C? Do you only want to initialise it to some value (i.e. 0) in the first step and then fill it with the distance calculations?
You could initialise C in the following way:
import pandas as pd
A = pd.DataFrame(['Amy', 'little', 'sheep', 'dead'])
B = pd.DataFrame(['Amy', 'George', 'Barbara'])
C = pd.DataFrame([[0] * len(B)] * len(A), index=A[0], columns=B[0])
C will then look like:
Amy George Barbara
0
Amy 0 0 0
little 0 0 0
sheep 0 0 0
dead 0 0 0
Please pd.DataFrame(index =[list],columns =[list])
Extract the relevant lists using list(df.columnname.values)
Dummy data
print(dfA)
Header
0 Amy
1 little
2 sheep
3 dead
print(dfB)
Header
0 Amy
1 George
2 Barbara
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values))
Amy George Barbara
Amy NaN NaN NaN
little NaN NaN NaN
sheep NaN NaN NaN
dead NaN NaN NaN
If interested in dfC without NaNS. Please
dfC=pd.DataFrame(index=list(dfA.Header.values), columns=list(dfB.Header.values)).fillna(' ')
Amy George Barbara
Amy
little
sheep
dead

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

rearrange name order in pandas column

Background
I have the following df
import pandas as pd
df= pd.DataFrame({'Text' : ['Hi', 'Hello', 'Bye'],
'P_ID': [1,2,3],
'Name' :['Bobby,Bob Lee Brian', 'Tuck,Tom T ', 'Mark, Marky '],
})
Name P_ID Text
0 Bobby,Bob Lee Brian 1 Hi
1 Tuck,Tom T 2 Hello
2 Mark, Marky 3 Bye
Goal
1) rearrange the Name column from e.g. Bobby,Bob Lee Brian to Bob Lee Brian Bobby
2) create new column Rearranged_Name
Desired Output
Name P_ID Text Rearranged_Name
0 Bobby,Bob Lee Brian 1 Hi Bob Lee Brian Bobby
1 Tuck,Tom T 2 Hello Tom T Tuck
2 Mark, Marky 3 Bye Marky Mark
Question
How do I achieve my desired output?
Use Series.str.replace with values before and after ,, \s* means there are optionally whitespace after ,:
df['Rearranged_Name'] = df['Name'].str.replace(r'(.+),\s*(.+)', r'\2 \1')
print (df)
Text P_ID Name Rearranged_Name
0 Hi 1 Bobby,Bob Lee Brian Bob Lee Brian Bobby
1 Hello 2 Tuck,Tom T Tom T Tuck
2 Bye 3 Mark, Marky Marky Mark
Or use Series.str.split for helper DataFrame and join columns together:
df1 = df['Name'].str.split(',\s*', expand=True)
df['Rearranged_Name'] = df1[1] + ' ' + df1[0]

Resources