Pandas: Better way to combine rows for 'wide' dataset? - python-3.x

I'm trying to make a 'wide' dataset, with one record per game, rather than one record per team, per game. Here's a small example of what I have, first, and then what I'd like to have.
GAME-ID TEAM SCORE
0 123 Cleveland 95
1 123 Orlando 101
2 124 New York 104
3 124 Detroit 98
GAME-ID TEAM1 TEAM2 SCORE1 SCORE2
0 123 Cleveland Orlando 95 101
1 124 New York Detroit 104 98
I can set a flag for game id count (see below), then later use a for loop to iterate through and set values conditionally, but thought there may be an easier way.
import pandas as pd
dict1 = {'GAME-ID':[123, 123, 124, 124],
'TEAM':['Cleveland', 'Orlando', 'New York', 'Detroit'],
'SCORE':[95, 101, 104, 98]}
df = pd.DataFrame(dict1)
df['GAME_ID_CT'] = df.groupby('GAME-ID').cumcount() + 1
print(df)
Result from code above:
GAME-ID TEAM SCORE GAME_ID_CT
0 123 Cleveland 95 1
1 123 Orlando 101 2
2 124 New York 104 1
3 124 Detroit 98 2
If there's a way to do this by column rather than a bunch of loops, it would be great.

You can try pivot:
new_df = df.pivot(index='GAME-ID',columns='GAME_ID_CT')
# rename
new_df.columns = [f'{a}{b}' for a,b in new_df.columns]
Output:
TEAM1 TEAM2 SCORE1 SCORE2
GAME-ID
123 Cleveland Orlando 95 101
124 New York Detroit 104 98

I think this actually worked best for me. It's simple and accommodates lots more variables.
df1 = df[df['GAME_ID_CT'] == 1]
df2 = df[df['GAME_ID_CT'] == 2]
new_df = pd.merge(df1, df2, on='GAME-ID', suffixes=['1', '2'])
print(new_df)
GAME-ID TEAM1 SCORE1 GAME_ID_CT1 TEAM2 SCORE2 GAME_ID_CT2
0 123 Cleveland 95 1 Orlando 101 2
1 124 New York 104 1 Detroit 98 2

Related

difference between two column of a dataframe

I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786

Perform computation on a value in one row and update another row's column with that value

I have a dataframe that looks somewhat like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
ABC XYZ 3523 454 4354 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I would like to apply some simple math on a column with a condition and assign it to a different row.
For instance
if Month == 2018-03 & Categor_2 == 'XYZ', perform Numeric_3*2 and assign it to Numeric_3 under month 2018-02.
So the output would be something like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3_ Adj Numeric_col4 Month
ABC XYZ 3523 454 246 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I was thinking of taking out the necessary columns, then doing a pivot, applying the math, then again reshaping it back in the orginal way.
However if there is a quick way, would be grateful to know
It depends what is length of Series of filtered DataFrame - here is one element Series, so possible set to scalar by next with iter for posible add default value if condition not match:
mask = (df.Month == '2018-03') & (df.Categor_2 == 'XYZ')
print (df.loc[mask, 'Numeric_3'] * 3)
1 369
Name: Numeric_3, dtype: int64
#get first value of Series, if emty Series is returned 0
a = next(iter(df.loc[mask, 'Numeric_3'] * 3), 0)
print (a)
369
df.loc[df.Month == '2018-02', 'Numeric_3'] = a
print (df)
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
0 ABC XYZ 3523 454 369 565 2018-02
1 ABC XYZ 333 444 123 565 2018-03
2 qww ggg 3222 568 123 483976 2018-03

How to extract a keyword(string) from a column in pandas dataframe in python

I have a dataframe df and it looks like this:
id Type agent_id created_at
0 44525 Stunning 6 bedroom villa in New Delhi 184 2018-03-09
1 44859 Villa for sale in Amritsar 182 2017-02-19
2 45465 House in Faridabad 154 2017-04-17
3 50685 5 Hectre land near New Delhi 113 2017-09-01
4 130728 Duplex in Mumbai 157 2017-02-07
5 130856 Large plot with fantastic views in Mumbai 137 2018-01-16
6 130857 Modern Design Penthouse in Bangalore 199 2017-03-24
I've this tabular data and I'm trying to clean this data by extracting keywords from the column and hence create a new dataframe with new columns.
Apartment = ['apartment', 'penthouse', 'duplex']
House = ['house', 'villa', 'country estate']
Plot = ['plot', 'land']
Location = ['New Delhi','Mumbai','Bangalore','Amritsar']
So the desired dataframe shoul look like this:
id Type Location agent_id created_at
0 44525 House New Delhi 184 2018-03-09
1 44859 House Amritsar 182 2017-02-19
2 45465 House Faridabad 154 2017-04-17
3 50685 Plot New Delhi 113 2017-09-01
4 130728 Apartment Mumbai 157 2017-02-07
5 130856 Plot Mumbai 137 2018-01-16
6 130857 Apartment Bangalore 199 2017-03-24
So till now i've tried this:
import pandas as pd
df = pd.read_csv('test_data.csv')
#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.
for index, values in df.type.iteritems():
for i in Apartment:
if i in values:
print(i)
df_new = pd. Dataframe(df['id'])
Can someone tell me how to solve this?
First create Location column by str.extract with | for regex OR:
pat = '|'.join(r"\b{}\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)
Then create dictionary from another lists, swap keys with values and in loop set value by mask with str.contains and parameter case=False:
d = {'Apartment' : Apartment,
'House' : House,
'Plot' : Plot}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
for k, v in d1.items():
df.loc[df['Type'].str.contains(k, case=False), 'Type'] = v
print (df)
id Type agent_id created_at Location
0 44525 House 184 2018-03-09 New Delhi
1 44859 House 182 2017-02-19 Amritsar
2 45465 House 154 2017-04-17 NaN
3 50685 Plot 113 2017-09-01 New Delhi
4 130728 Apartment 157 2017-02-07 Mumbai
5 130856 Plot 137 2018-01-16 Mumbai
6 130857 Apartment 199 2017-03-24 Bangalore
106 if isna(key).any():
--> 107 raise ValueError('cannot index with vector containing '
108 'NA / NaN values')
109 return False
ValueError: cannot index with vector containing NA / NaN values
I got above error

count in between values in a column on condition say I need to count for only specific value and group them by another column using python pandas

I have the following input
Input:
Bus Fare Startcity
56 98 sathy
95 85 sathy
98 95 chennai
85 92 chennai
56 75 chennai
56 83 chennai
I have to count it by fare >=90 and fare<=98 and groupby "Startcity"
Output 1:
Fare Startcity
1 Sathy
2 Chennai
Also to calculate averageif fare >=90 and fare<=98 and groupby "Startcity"
Output 2:
Fare Startcity
98 Sathy
93.5 Chennai
If want count number of rows per condition per groups create boolean mask by ge (<=) and count True values by sum:
df1 = df['Fare'].ge(90).groupby(df['Startcity']).sum().astype(int).reset_index()
print (df1)
Startcity Fare
0 chennai 2
1 sathy 1
If want check between with filtering use:
df = df[df['Fare'].between(90, 98)].groupby('Startcity')['Fare'].mean().reset_index()
print (df)
Startcity Fare
0 chennai 93.5
1 sathy 98.0
Or if need also 0 for non matched groups:
df3=df.groupby('Startcity')['Fare'].apply(lambda x: x[x.between(90, 98)].mean()).reset_index()
print (df3)
Startcity Fare
0 chennai 93.5
1 sathy 98.0

Grouping and Multiindexing a pandas dataframe

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?
If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Resources