Perform computation on a value in one row and update another row's column with that value - python-3.x

I have a dataframe that looks somewhat like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
ABC XYZ 3523 454 4354 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I would like to apply some simple math on a column with a condition and assign it to a different row.
For instance
if Month == 2018-03 & Categor_2 == 'XYZ', perform Numeric_3*2 and assign it to Numeric_3 under month 2018-02.
So the output would be something like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3_ Adj Numeric_col4 Month
ABC XYZ 3523 454 246 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I was thinking of taking out the necessary columns, then doing a pivot, applying the math, then again reshaping it back in the orginal way.
However if there is a quick way, would be grateful to know

It depends what is length of Series of filtered DataFrame - here is one element Series, so possible set to scalar by next with iter for posible add default value if condition not match:
mask = (df.Month == '2018-03') & (df.Categor_2 == 'XYZ')
print (df.loc[mask, 'Numeric_3'] * 3)
1 369
Name: Numeric_3, dtype: int64
#get first value of Series, if emty Series is returned 0
a = next(iter(df.loc[mask, 'Numeric_3'] * 3), 0)
print (a)
369
df.loc[df.Month == '2018-02', 'Numeric_3'] = a
print (df)
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
0 ABC XYZ 3523 454 369 565 2018-02
1 ABC XYZ 333 444 123 565 2018-03
2 qww ggg 3222 568 123 483976 2018-03

Related

Create two new Dataframes from existing one based on unique and repeated values of a column

colA colB
A 125
B 546
C 4586
D 547
A 869
B 789
A 258
E 123
I want to create two new dataframe and the first one should be based on the unique values in 'colA' and the second one should be the repeated values of 'colB'. The colB has no repeated values. The first output is like this:
ColA colB
A 125
B 546
C 4586
D 547
E 123
The second output is like this:
colA colB
A 869
B 789
A 258
For the first group, use drop_duplicates. For second group, use duplicated:
print (df.drop_duplicates("colA"))
colA colB
0 A 125
1 B 546
2 C 4586
3 D 547
7 E 123
print (df[df.duplicated("colA")])
colA colB
4 A 869
5 B 789
6 A 258

Converting each row of a dataframe into a string and assigning it as a column to another dataset in pandas

I have a dataset:
id name address phone email
123 abc 123 abc 12345 info#abc.com
456 cbs 456 cbs 67890 info#cbs.com
758 nbc 789 nbc 11121 info#nbc.com
I want to create a new dataset, where it retains the first two columns (id and name) and has the third column, which will have a string that is a combination of values of address, phone and email. In other words, I need it to look like this:
id name meta_str
123 abc '123 abc 12345 info#abc.com'
456 cbs '456 cbs 67890 info#cbs.com'
758 nbc '789 nbc 11121 info#nbc.com'
This is the code I have:
df_transformed = df[['id','name']]
df_meta = df[['address','phone','email']]
df_meta_str = df_meta.iloc[:].to_string(header=False, index=False)
df_transformed['meta_str'] = df_meta_str
But what I get is:
id name meta_str
123 abc '123 abc 12345 info#abc.com'
456 cbs '123 abc 12345 info#abc.com'
758 nbc '123 abc 12345 info#abc.com'
I think the problem is that df_meta_str has the data in all rows combined as one big string.
What would be a way to achieve a separate string on a separate row?
I will do
df['meta_str']=df.loc[:,'address':].astype(str).agg(' '.join,1)
0 123abc 12345 info#abc.com
1 456cbs 67890 info#cbs.com
2 789nbc 11121 info#nbc.com
dtype: object
You can use simple str concatenation:
df['meta_str'] = df.address + ' ' + df.phone.astype(str) + ' ' + df.email
df.drop(['address','phone','email'], 1, inplace=True)
Output:
id name meta_str
123 abc 123 abc 12345 info#abc.com
456 cbs 456 cbs 67890 info#cbs.com
758 nbc 789 nbc 11121 info#nbc.com
OR
use df.apply method:
df['meta_str'] = df[['address','phone','email']].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
You can use pd.Series.cat here.
df['meta_str'] = df['address'].str.cat(df[['phone','email']].astype(str),sep=' ')
df.drop(columns='address')
id name meta_str
0 123 abc 123 abc 12345 info#abc.com
1 456 cbs 456 cbs 67890 info#cbs.com
2 758 nbc 789 nbc 11121 info#nbc.com

difference between two column of a dataframe

I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786

Pandas: Better way to combine rows for 'wide' dataset?

I'm trying to make a 'wide' dataset, with one record per game, rather than one record per team, per game. Here's a small example of what I have, first, and then what I'd like to have.
GAME-ID TEAM SCORE
0 123 Cleveland 95
1 123 Orlando 101
2 124 New York 104
3 124 Detroit 98
GAME-ID TEAM1 TEAM2 SCORE1 SCORE2
0 123 Cleveland Orlando 95 101
1 124 New York Detroit 104 98
I can set a flag for game id count (see below), then later use a for loop to iterate through and set values conditionally, but thought there may be an easier way.
import pandas as pd
dict1 = {'GAME-ID':[123, 123, 124, 124],
'TEAM':['Cleveland', 'Orlando', 'New York', 'Detroit'],
'SCORE':[95, 101, 104, 98]}
df = pd.DataFrame(dict1)
df['GAME_ID_CT'] = df.groupby('GAME-ID').cumcount() + 1
print(df)
Result from code above:
GAME-ID TEAM SCORE GAME_ID_CT
0 123 Cleveland 95 1
1 123 Orlando 101 2
2 124 New York 104 1
3 124 Detroit 98 2
If there's a way to do this by column rather than a bunch of loops, it would be great.
You can try pivot:
new_df = df.pivot(index='GAME-ID',columns='GAME_ID_CT')
# rename
new_df.columns = [f'{a}{b}' for a,b in new_df.columns]
Output:
TEAM1 TEAM2 SCORE1 SCORE2
GAME-ID
123 Cleveland Orlando 95 101
124 New York Detroit 104 98
I think this actually worked best for me. It's simple and accommodates lots more variables.
df1 = df[df['GAME_ID_CT'] == 1]
df2 = df[df['GAME_ID_CT'] == 2]
new_df = pd.merge(df1, df2, on='GAME-ID', suffixes=['1', '2'])
print(new_df)
GAME-ID TEAM1 SCORE1 GAME_ID_CT1 TEAM2 SCORE2 GAME_ID_CT2
0 123 Cleveland 95 1 Orlando 101 2
1 124 New York 104 1 Detroit 98 2

How to extract a keyword(string) from a column in pandas dataframe in python

I have a dataframe df and it looks like this:
id Type agent_id created_at
0 44525 Stunning 6 bedroom villa in New Delhi 184 2018-03-09
1 44859 Villa for sale in Amritsar 182 2017-02-19
2 45465 House in Faridabad 154 2017-04-17
3 50685 5 Hectre land near New Delhi 113 2017-09-01
4 130728 Duplex in Mumbai 157 2017-02-07
5 130856 Large plot with fantastic views in Mumbai 137 2018-01-16
6 130857 Modern Design Penthouse in Bangalore 199 2017-03-24
I've this tabular data and I'm trying to clean this data by extracting keywords from the column and hence create a new dataframe with new columns.
Apartment = ['apartment', 'penthouse', 'duplex']
House = ['house', 'villa', 'country estate']
Plot = ['plot', 'land']
Location = ['New Delhi','Mumbai','Bangalore','Amritsar']
So the desired dataframe shoul look like this:
id Type Location agent_id created_at
0 44525 House New Delhi 184 2018-03-09
1 44859 House Amritsar 182 2017-02-19
2 45465 House Faridabad 154 2017-04-17
3 50685 Plot New Delhi 113 2017-09-01
4 130728 Apartment Mumbai 157 2017-02-07
5 130856 Plot Mumbai 137 2018-01-16
6 130857 Apartment Bangalore 199 2017-03-24
So till now i've tried this:
import pandas as pd
df = pd.read_csv('test_data.csv')
#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.
for index, values in df.type.iteritems():
for i in Apartment:
if i in values:
print(i)
df_new = pd. Dataframe(df['id'])
Can someone tell me how to solve this?
First create Location column by str.extract with | for regex OR:
pat = '|'.join(r"\b{}\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)
Then create dictionary from another lists, swap keys with values and in loop set value by mask with str.contains and parameter case=False:
d = {'Apartment' : Apartment,
'House' : House,
'Plot' : Plot}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
for k, v in d1.items():
df.loc[df['Type'].str.contains(k, case=False), 'Type'] = v
print (df)
id Type agent_id created_at Location
0 44525 House 184 2018-03-09 New Delhi
1 44859 House 182 2017-02-19 Amritsar
2 45465 House 154 2017-04-17 NaN
3 50685 Plot 113 2017-09-01 New Delhi
4 130728 Apartment 157 2017-02-07 Mumbai
5 130856 Plot 137 2018-01-16 Mumbai
6 130857 Apartment 199 2017-03-24 Bangalore
106 if isna(key).any():
--> 107 raise ValueError('cannot index with vector containing '
108 'NA / NaN values')
109 return False
ValueError: cannot index with vector containing NA / NaN values
I got above error

Resources