How to extract a keyword(string) from a column in pandas dataframe in python - python-3.x

I have a dataframe df and it looks like this:
id Type agent_id created_at
0 44525 Stunning 6 bedroom villa in New Delhi 184 2018-03-09
1 44859 Villa for sale in Amritsar 182 2017-02-19
2 45465 House in Faridabad 154 2017-04-17
3 50685 5 Hectre land near New Delhi 113 2017-09-01
4 130728 Duplex in Mumbai 157 2017-02-07
5 130856 Large plot with fantastic views in Mumbai 137 2018-01-16
6 130857 Modern Design Penthouse in Bangalore 199 2017-03-24
I've this tabular data and I'm trying to clean this data by extracting keywords from the column and hence create a new dataframe with new columns.
Apartment = ['apartment', 'penthouse', 'duplex']
House = ['house', 'villa', 'country estate']
Plot = ['plot', 'land']
Location = ['New Delhi','Mumbai','Bangalore','Amritsar']
So the desired dataframe shoul look like this:
id Type Location agent_id created_at
0 44525 House New Delhi 184 2018-03-09
1 44859 House Amritsar 182 2017-02-19
2 45465 House Faridabad 154 2017-04-17
3 50685 Plot New Delhi 113 2017-09-01
4 130728 Apartment Mumbai 157 2017-02-07
5 130856 Plot Mumbai 137 2018-01-16
6 130857 Apartment Bangalore 199 2017-03-24
So till now i've tried this:
import pandas as pd
df = pd.read_csv('test_data.csv')
#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.
for index, values in df.type.iteritems():
for i in Apartment:
if i in values:
print(i)
df_new = pd. Dataframe(df['id'])
Can someone tell me how to solve this?

First create Location column by str.extract with | for regex OR:
pat = '|'.join(r"\b{}\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)
Then create dictionary from another lists, swap keys with values and in loop set value by mask with str.contains and parameter case=False:
d = {'Apartment' : Apartment,
'House' : House,
'Plot' : Plot}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
for k, v in d1.items():
df.loc[df['Type'].str.contains(k, case=False), 'Type'] = v
print (df)
id Type agent_id created_at Location
0 44525 House 184 2018-03-09 New Delhi
1 44859 House 182 2017-02-19 Amritsar
2 45465 House 154 2017-04-17 NaN
3 50685 Plot 113 2017-09-01 New Delhi
4 130728 Apartment 157 2017-02-07 Mumbai
5 130856 Plot 137 2018-01-16 Mumbai
6 130857 Apartment 199 2017-03-24 Bangalore

106 if isna(key).any():
--> 107 raise ValueError('cannot index with vector containing '
108 'NA / NaN values')
109 return False
ValueError: cannot index with vector containing NA / NaN values
I got above error

Related

reshape dataframe time series

[![enter image description here][1]][1]I have a dataframe for a weather data in certain shape and i want to transform it, but struggling on it.
My dataframe looks like that :
city temp_day1, temp_day2, temp_day3 ...., hum_day1, hum_day2, hum_day4, ..., condition
city_1 12 13 20 44 44.5 good 44
city_1 12 13 20 44 44.5
bad 44
city_2 14 04 33 44 44.5
good 44
I want to transforme it to
city_1 city_2 .....
day. temperature humidity condition ... temperature humidity condition
1 12 44 good . 12 13
20 44 44.5
2 13 44 .5 bad .
3 20 NaN bad .
4 NaN 44 .
some day dont have temperature values and humidity values
Thanks for your help
Use wide_to_long with DataFrame.unstack and last DataFrame.swaplevel and DataFrame.sort_index:
df1 = (pd.wide_to_long(df,
stubnames=['temp','hum'],
i='city',
j='day',
sep='_',
suffix='\w+')
.unstack(0)
.swaplevel(1,0, axis=1)
.sort_index(axis=1))
print (df1)
city city_1
hum temp
day
day1 44.0 12.0
day2 44.5 13.0
day3 NaN 20.0
day4 44.0 NaN
Alternative solution:
df1 = df.set_index('city')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack([0,1]).unstack([0,1])
If need extract numbers from index:
df1 = (pd.wide_to_long(df,
stubnames=['temp','hum'],
i='city',
j='day',
sep='_',
suffix='\w+')
.unstack(0)
.swaplevel(1,0, axis=1)
.sort_index(axis=1))
df1.index = df1.index.str.extract('(\d+)', expand=False)
print (df1)
city city_1
hum temp
day
1 44.0 12.0
2 44.5 13.0
3 NaN 20.0
4 44.0 NaN
EDIT:
Solution with real data:
df1 = df.set_index(['condition', 'ACTIVE', 'mode', 'apply', 'spy', 'month'], append=True)
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack([0,1]).unstack([0,-2])
If need remove unnecessary levels in MultiIndex:
df1 = df1.reset_index(level=['condition', 'ACTIVE', 'mode', 'apply', 'spy', 'month'], drop=True)
You can use pandas transpose method like this: df.T
This will turn your dataframe into one row. If you create multiple columns, you can slice it with indexing and assing each slice to independent columns.

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

difference between two column of a dataframe

I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786

Pandas: Better way to combine rows for 'wide' dataset?

I'm trying to make a 'wide' dataset, with one record per game, rather than one record per team, per game. Here's a small example of what I have, first, and then what I'd like to have.
GAME-ID TEAM SCORE
0 123 Cleveland 95
1 123 Orlando 101
2 124 New York 104
3 124 Detroit 98
GAME-ID TEAM1 TEAM2 SCORE1 SCORE2
0 123 Cleveland Orlando 95 101
1 124 New York Detroit 104 98
I can set a flag for game id count (see below), then later use a for loop to iterate through and set values conditionally, but thought there may be an easier way.
import pandas as pd
dict1 = {'GAME-ID':[123, 123, 124, 124],
'TEAM':['Cleveland', 'Orlando', 'New York', 'Detroit'],
'SCORE':[95, 101, 104, 98]}
df = pd.DataFrame(dict1)
df['GAME_ID_CT'] = df.groupby('GAME-ID').cumcount() + 1
print(df)
Result from code above:
GAME-ID TEAM SCORE GAME_ID_CT
0 123 Cleveland 95 1
1 123 Orlando 101 2
2 124 New York 104 1
3 124 Detroit 98 2
If there's a way to do this by column rather than a bunch of loops, it would be great.
You can try pivot:
new_df = df.pivot(index='GAME-ID',columns='GAME_ID_CT')
# rename
new_df.columns = [f'{a}{b}' for a,b in new_df.columns]
Output:
TEAM1 TEAM2 SCORE1 SCORE2
GAME-ID
123 Cleveland Orlando 95 101
124 New York Detroit 104 98
I think this actually worked best for me. It's simple and accommodates lots more variables.
df1 = df[df['GAME_ID_CT'] == 1]
df2 = df[df['GAME_ID_CT'] == 2]
new_df = pd.merge(df1, df2, on='GAME-ID', suffixes=['1', '2'])
print(new_df)
GAME-ID TEAM1 SCORE1 GAME_ID_CT1 TEAM2 SCORE2 GAME_ID_CT2
0 123 Cleveland 95 1 Orlando 101 2
1 124 New York 104 1 Detroit 98 2

Perform computation on a value in one row and update another row's column with that value

I have a dataframe that looks somewhat like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
ABC XYZ 3523 454 4354 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I would like to apply some simple math on a column with a condition and assign it to a different row.
For instance
if Month == 2018-03 & Categor_2 == 'XYZ', perform Numeric_3*2 and assign it to Numeric_3 under month 2018-02.
So the output would be something like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3_ Adj Numeric_col4 Month
ABC XYZ 3523 454 246 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I was thinking of taking out the necessary columns, then doing a pivot, applying the math, then again reshaping it back in the orginal way.
However if there is a quick way, would be grateful to know
It depends what is length of Series of filtered DataFrame - here is one element Series, so possible set to scalar by next with iter for posible add default value if condition not match:
mask = (df.Month == '2018-03') & (df.Categor_2 == 'XYZ')
print (df.loc[mask, 'Numeric_3'] * 3)
1 369
Name: Numeric_3, dtype: int64
#get first value of Series, if emty Series is returned 0
a = next(iter(df.loc[mask, 'Numeric_3'] * 3), 0)
print (a)
369
df.loc[df.Month == '2018-02', 'Numeric_3'] = a
print (df)
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
0 ABC XYZ 3523 454 369 565 2018-02
1 ABC XYZ 333 444 123 565 2018-03
2 qww ggg 3222 568 123 483976 2018-03

Resources