Lookup value in one dataframe and paste it into another dataframe - python-3.x

I have two dataframes in Python one big (car listings), one small (car base configuration prices). The small one looks like this:
Make Model MSRP
0 Acura ILX 27990
1 Acura MDX 43015
2 Acura MDX Sport Hybrid 51960
3 Acura NSX 156000
4 Acura RDX 35670
5 Acura RLX 54450
6 Acura TLX 31695
7 Alfa Romeo 4C 55900
8 Alfa Romeo Giulia 37995
… … … . …
391 Toyota Yaris 14895
392 Toyota Yaris iA 15950
393 Volkswagen Atlas 33500
394 Volkswagen Beetle 19795
395 Volkswagen CC 34475
396 Volkswagen GTI 24995
397 Volkswagen Golf 19575
398 Volkswagen Golf Alltrack 25850
399 Volkswagen Golf R 37895
400 Volkswagen Golf SportWagen 21580
401 Volkswagen Jetta 17680
402 Volkswagen Passat 22440
403 Volkswagen Tiguan 24890
404 Volkswagen Touareg 42705
405 Volkswagen e-Golf 28995
406 Volvo S60 33950
Now I want to paste the values from the MSRP column (far right column) based on matching the Make and Model columns into the big dataframe (car listings) that looks like the following:
makeName modelName trimName carYear mileage
0 BMW X5 sDrive35i 2017 0
1 BMW X5 sDrive35i 2017 3
2 BMW X5 sDrive35i 2017 0
3 Audi A4 Premium Plus2017 0
4 Kia Optima LX 2016 10
5 Kia Optima SX Turbo 2017 15
6 Kia Optima EX 2016 425
7 Rolls-Royce Ghost Series II 2017 15
… … … … … …
In the end I would like to have the following:
makeName modelName trimName carYear mileage MSRP
0 BMW X5 sDrive35i 2017 0 value from the other table
1 BMW X5 sDrive35i 2017 3 value from the other table
2 BMW X5 sDrive35i 2017 0 value from the other table
3 Audi A4 Premium Plus2017 0 value from the other table
4 Kia Optima LX 2016 10 value from the other table
5 Kia Optima SX Turbo 2017 15 value from the other table
6 Kia Optima EX 2016 425 value from the other table
7 Rolls-Royce Ghost Series II 2017 15 value from the other table
… … … … … …
I read the documentation regarding pd.concat, merge and join but I am not making any progress.
Can you guys help?
Thanks!

You can use merge to join the two dataframes together.
car_base.merge(car_listings, left_on=['makeName','modelName'], right_on=['Make','Model'])

Related

Pivoting a table with duplicate index

I wanted to pivot this table:
Year County Sex rate
0 2006 Alameda Male 45.80
1 2006 Alameda Female 54.20
2 2006 Alpine Male 52.81
3 2006 Alpine Female 47.19
4 2006 Amador Male 49.97
5 2006 Amador female 50.30
My desired output is:
Year County Male Female
2006 Alameda 45.80 54.20
2006 Alameda 52.81 47.19
2006 Alpine 49.97 50.30
I tried doing this:
sex_rate=g.pivot(index="County",columns='Year',values='rate')
But I keep getting this error:
ValueError: Index contains duplicate entries, cannot reshape
Please help. I am new to python
I think you want index=['Year', 'County'], not just index='County'. And since you are passing two columns to index, you may want to use pivot_table instead of pivot:
df.pivot_table(index=['Year','County'],
columns='Sex', values='rate'
).reset_index()
Output:
Sex Year County Female Male
0 2006 Alameda 54.20 45.80
1 2006 Alpine 47.19 52.81
2 2006 Amador 50.30 49.97

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Python percentage of 2 columns in new column based on condition

I have asked earlier this question and got some feedback however I am still stuck in some mystery where I am not able to calculate the percentage of 2 columns based on conditions. 2 columns are ‘tested population’ and ‘total population’ based on grouping ‘Year’ & ‘Gender’ and show it in new column as ‘percentage’…
Year Race Gender Tested population Total population
2017 Asian Male 345 567
2017 Hispanic Female 666 67899
2018 Native Male 333 35543
2018 Asian Female 665 78955
2019 Hispanic Female 4444 44356
2020 Native Male 3642 6799
2017 Asian Male 5467 7998
2018 Asian Female 5467 7998
2019 Hispanic Male 456 4567
Table
code
df = pd.DataFrame(alldata, columns=['Year', 'Gender', 'Tested population', 'Total population'])
df2 = df.groupby(['Year', 'Gender']).agg({'Tested population': 'sum'})
pop_pcts = df2.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
print(pop_pcts)
Output:
Tested population
Year Gender
2017 Female 10.280951
Male 89.719049
2018 Female 94.849188
Male 5.150812
2019 Female 90.693878
Male 9.306122
2020 Male 100.000000
Whereas i want data as in this format to show along with other columns as a new column 'Percentage' .
Year Race Gender Tested population Total population Percentage
2017 Asian Male 345 567 60.8466
2017 Hispanic Female 666 67899 0.98087
2018 Native Male 333 35543 0.93689
2018 Asian Female 665 78955 0.84225
2019 Hispanic Female 4444 44356 10.0189
2020 Native Male 3642 6799 53.5667
2019 Hispanic Male 456 4567 9.98467
I have gone through Pandas percentage of total with groupby
and not able to fix my issues, can someone help on this
df['Percentage'] = df['Tested population']/df['Total Population']
I believe you just need to add a column.

How to fill empty cell value in pandas with condition

My sample dataset is as below. Actuall data till 2020 is available.
Item Year Amount final_sales
A1 2016 123 400
A2 2016 23 40
A3 2016 6
A4 2016 10 100
A5 2016 5 200
A1 2017 123 400
A2 2017 23
A3 2017 6
A4 2017 10
A5 2017 5 200
I have to extrapolate 2017 (and subsequent years) final_sales column data from 2016 for every Item if 2017 data not available.
In the above dataset final_sales not available for the year 2017 for A2 and A4 but available for 2016 year. How to bring in 2016 data (final_sales) value if corresponding year final_sales not available?
Expected results as below. Thanks.
Item Year Amount final_sales
A1 2016 123 400
A2 2016 23 40
A3 2016 6
A4 2016 10 100
A5 2016 5 200
A1 2017 123 400
A2 2017 23 40
A3 2017 6
A4 2017 10 100
A5 2017 5 200
It looks like you want to fill forward where there is missing data.
You can do this with 'fillna', which is available on pd.DataFrame objects.
In your case, you only want to fill forward for each item, so first group by item, and then use fillna. The method 'pad' just carries forward in order (hence why we sort first).
df['final_sales'] = df.sort_values('Year').groupby('Item')['final_sales'].fillna(method='pad')
Note that on your example data, A3 is missing for 2016 as well, so there is nothing to carry forward and it remains missing for 2017.
For me working GroupBy.ffill, only necessary sorted Year column like in question sample data:
#if necessary sorting by both columns
df = df.sort_values(['Year', 'Item'])
df['final_sales'] = df.groupby('Item')['final_sales'].ffill()
print (df)
Item Year Amount final_sales
0 A1 2016 123 400.0
1 A2 2016 23 40.0
2 A3 2016 6 NaN
3 A4 2016 10 100.0
4 A5 2016 5 200.0
5 A1 2017 123 400.0
6 A2 2017 23 40.0
7 A3 2017 6 NaN
8 A4 2017 10 100.0
9 A5 2017 5 200.0
Something like this?:
def fill_final(x):
if x['year'] != 2016:
return df[(df['year'] == 2016) & (df['Item'] == x['Item'])]['final_sales']
else: return x['final_sales']
df['final_sales'] = df.apply(lambda x: fill_final(x), axis = 1)
did not test this but should set you on the right path

selecting rows in a data.frame in which a certain column has values containing one of a set of prefixes

I have a data.frame of the type:
> head(engschools)
RECTYPE LEA ESTAB URN SCHNAME TOWN PCODE
1 1 919 2028 138231 Alban City School n.a. E1 3RR
2 1 919 4003 138582 Samuel Ryder Academy St Albans AL1 5AR
3 1 919 2004 138201 Hatfield Community Free School Hatfield AL10 8ES
4 2 919 7012 117671 St Luke's School n.a BR3 7ET
5 1 919 2018 138561 Harpenden Free School Redbourn AL3 7QA
6 2 919 7023 117680 Lakeside School Welwyn Garden City AL8 6YN
And a set of prefixes like this one:
>head(prefixes)
E
AL
I would like to select the rows from the data.frame engschools that have values in column PCODE which contain one of the prefixes in prefixes. The correct result would thus contain rows 1:3 and 5:6 but not row 4.
You can try something like this:
mydf[grep(paste0("^", prefixes, collapse="|"), engschools$PCODE), ]
# RECTYPE LEA ESTAB URN SCHNAME TOWN PCODE
# 1 1 919 2028 138231 Alban City School n.a. E1 3RR
# 2 1 919 4003 138582 Samuel Ryder Academy St Albans AL1 5AR
# 3 1 919 2004 138201 Hatfield Community Free School Hatfield AL10 8ES
# 5 1 919 2018 138561 Harpenden Free School Redbourn AL3 7QA
# 6 2 919 7023 117680 Lakeside School Welwyn Garden City AL8 6YN
Here, we have used:
paste to create our search pattern (in this case, "^E|^AL").
grep to identify the row indexes that match the provided pattern.
Basic [ style extracting to extract the relevant rows.

Resources