Python Pandas groupby - pandas-groupby

I have a dataframe df as
Name Date Sale
John 2018-3-23 10000
John 2018-3-24 12000
John 2018-3-28 11000
Mary 2018-3-25 15000
Mary 2018-3-29 12000
Mary 2018-3-31 13000
Sam 2018-3-25 18000
Sam 2018-3-26 12000
Sam 2018-3-27 14000
I would like to find the Sale of each person on their last date.
Name Date Sale
John 2018-3-28 11000
Mary 2018-3-31 13000
Sam 2018-3-27 14000
I tried to write the groupby statement as
df.groupby('Name')['Date'].apply(lambda x: x.max())
But it only displays Name and Date but not Sale.
What is the correct command?

Try this:
a = df.groupby('Name')['Date'].apply(lambda x: x.max()).reset_index()
b = a.merge(df, on = list(a), how = 'left')

Related

Closest Combined Match in Excel

I need to come up with a formula that gives me the closest match based on multiple criteria.
This only applies for those names with Valid = "N" (Column I), or else their own name is returned.
For example, Bob has the closest numbers combined to Susan for each day.
I rearranged all of the "valid" names in the table to the right so that Bob does not return "Bob" as the closest match (since we only want to return names that are valid anyways).
I have attached an image of the data below (with column J being my desired results)
Name
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Valid (Y/N)
Desired Combined Closest Match
Calculated Closest Match
Bob
100
25
750
40
750
750
750
N
Susan
Susan
Susan
100
25
500
25
1000
1000
1000
Y
Susan
Susan
Karen
75
30
500
50
Y
Karen
Karen
Michele
100
30
500
50
Y
Michele
Michele
Tom
75
30
240
30
Y
Tom
Tom
Rob
100
30
1000
30
Y
Rob
Rob
Brian
100
30
1000
50
N
Rob
Susan
Stacy
100
30
500
50
Y
Stacy
Stacy
Rachel
100
30
500
50
2000
2000
2000
Y
Rachel
Rachel
Jeff
100
30
500
50
1000
1000
1000
N
Susan
Susan
I am able to do this only using 1 column of data, but I am unsure how to combine them. I used INDEX, MATCH, MIN, and ABS functions to figure out the closest match based on one column (Sunday). Here is an example of my formula for Brian (it is supposed to return "Rob" as the result, but returns "Susan" because I only am using one day of data):
=IF(I8="Y",A8,INDEX($M$2:$M$8,MATCH(MIN(ABS($N$2:$N$8-B8)),ABS($N$2:$N$8-B8),0)))

How update a dataframe column value from second dataframe where values on two specific columns that can repeat on first match on both dataframes?

I have two dataframes with different information about a person, on the first dataframe, person's name may repeat in different rows. I want to add/update the first dataframe with data from the second dataframe where the two columns containing person's data matches on both. Here an example on what I need to accomplish:
df1:
name surname
0 john doe
1 mary doe
2 peter someone
3 mary doe
4 john another
5 paul another
df2:
name surname account_id
0 peter someone 100
1 john doe 200
2 mary doe 300
3 john another 400
I need to accomplish this:
df1:
name surname account_id
0 john doe 200
1 mary doe 300
2 peter someone 100
3 mary doe 300
4 john another 400
5 paul another <empty>
Thanks!

pandas: groupby + store in another dataframe

I asked a similar question last week and now I have a similar issue, but I cannot convert the answer I received in this case.
Basically, I have a dataframe called comms which looks like this:
articleID Material commentScore
1234 News 0.75
1234 News -0.1
5678 Sport 1.33
5678 News 0.75
5678 Fashion 0.02
7412 Politics -3.45
and another dataframe called arts and it looks like this:
articleID wordCount byLine
1234 1524 John
5678 9824 Mary
7412 3713 Sam
I would like to simply count how many comms there are for each articleID, and store this number in a new column of the arts dataframe named commentNumber.
I think I have to use groupby, count() and maybe merge, but I can't figure out why.
Expected output
articleID wordCount byLine commentNumber
1234 1524 John 2
5678 9824 Mary 3
7412 3713 Sam 1
Thanks in advance!
Andrea
Use groupby() then count() on one column. At last, map the result with articleID columns of arts.
arts['commentNumber'] = arts['articleID'].map(comms.groupby('articleID')['Material'].count())
print(arts)
articleID wordCount byLine commentNumber
0 1234 1524 John 2
1 5678 9824 Mary 3
2 7412 3713 Sam 1
Use Series.map with Series.value_counts:
arts['commentNumber'] = arts['articleID'].map(comms['articleID'].value_counts())
print (arts)
articleID wordCount byLine commentNumber
0 1234 1524 John 2
1 5678 9824 Mary 3
2 7412 3713 Sam 1
Alternative:
from collections import Counter
arts['commentNumber'] = arts['articleID'].map(Counter(comms['articleID']))

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

Appending new elements to a column in pandas dataframe

I have a pandas dataframe like this:
df1:
id name gender
1 Alice Male
2 Jenny Female
3 Bob Male
And now I want to add a new column sport which will contain values in the form of list.Let's I want to add Football to the rows where gender is male So df1 will look like:
df1:
id name gender sport
1 Alice Male [Football]
2 Jenny Female NA
3 Bob Male [Football]
Now if I want to add Badminton to rows where gender is female and tennis to rows where gender is male so that final output is:
df1:
id name gender sport
1 Alice Male [Football,Tennis]
2 Jenny Female [Badminton]
3 Bob Male [Football,Tennis]
How to write a general function in python which will accomplish this task of appending new values into the column based upon some other column value?
The below should work for you. Initialize column with an empty list and proceed
df['sport'] = np.empty((len(df), 0)).tolist()
def append_sport(df, filter_df, sport):
df.loc[filter_df, 'sport'] = df.loc[filter_df, 'sport'].apply(lambda x: x.append(sport) or x)
return df
filter_df = (df.gender == 'Male')
df = append_sport(df, filter_df, 'Football')
df = append_sport(df, filter_df, 'Cricket')
Output
id name gender sport
0 1 Alice Male [Football, Cricket]
1 2 Jenny Female []
2 3 Bob Male [Football, Cricket]

Resources