Reshaping pandas dataframe with count option - python-3.x

Currently, I have this dataframe in pandas:
year product
8 2016 apples
15 2016 kiwis
17 2016 mango
24 2016 mango
32 2016 mango
34 2016 peach
41 2017 peach
45 2017 peach
48 2017 peach
53 2017 bananas
54 2017 mango
72 2017 peach
73 2017 peach
I've been trying with melt and pivot, but alas no luck. Basically I want to count the instances of products I sold each year. What I want as dataframe is this:
apples peach bananas kiwi mango
2016 1 1 0 1 3
2017 0 5 1 0 1
How can i reshape my df to the desired outcome?

Try:
df1 = df.groupby(['year', 'product']).size().unstack(1, fill_value=0)
print(df1)
product apples bananas kiwis mango peach
year
2016 1 0 1 3 1
2017 0 1 0 1 5
Or like mentioned in the comments:
pd.crosstab(df['year'], df['product'])

Related

add column from a dataframe to another dataframe with same rows

I have a dataframe (df) that contains 30 000 rows
id Name Age
1 Joey 22
2 Anna 34
3 Jon 33
4 Amy 30
5 Kay 22
And Another dataframe (df2) that contains same columns but with some Ids missing
id Name Age Sport
Jon 33 Tennis
5 Kay 22 Football
Joey 22 Basketball
4 Amy 30 Running
Anna 42 Dancing
I want the missing IDs to appear in df2 with the correspondant name
df2:
id Name Age Sport
3 Jon 33 Tennis
5 Kay 22 Football
1 Joey 22 Basketball
4 Amy 30 Running
2 Anna 42 Dancing
Can someone help ? I am new to pandas and dataframe
you can use .map with .fillna
df2['id'] = df2['id'].replace('',np.nan,regex=True)\
.fillna(df2['Name'].map(df1.set_index('Name')['id'])).astype(int)
print(df2)
id Name Age Sport
0 3 Jon 33 Tennis
1 5 Kay 22 Football
2 1 Joey 22 Basketball
3 4 Amy 30 Running
4 2 Anna 42 Dancing
First, join the two dataframes with pd.merge based on your keys. I suppose the keys are 'Name' and 'Age' in this case. Then replace the null id values in df2, using np.where and .isnull() to find the null values.
df3 = pd.merge(df2, df1, on=['name', 'age'], how='left')
df2['id'] = np.where(df3.id_x.isnull(), df3.id_y, df3.id_x).astype(int)
id name age sport
0 1 Joey 22 Tennis
1 2 Anna 34 Football
2 3 Jon 33 Basketball
3 4 Amy 30 Running
4 5 Kay 22 Dancing

pandas groupby return data on original MultiIndex

Please see my example below, how can I return the data from the groupby on all 3 levels of the original MultiIndex?
In this example: I want to see the totals by brand. I have now applied a workaround using map (see below, this shows the output that I hope to get straight from the groupby).
brands = ['Tesla','Tesla','Tesla','Peugeot', 'Peugeot', 'Citroen', 'Opel', 'Opel', 'Peugeot', 'Citroen', 'Opel']
years = [2018, 2017,2016, 2018, 2017, 2017, 2018, 2017,2016, 2016,2016]
owners = ['Tesla','Tesla','Tesla','PSA', 'PSA', 'PSA', 'PSA', 'PSA','PSA', 'PSA', 'PSA']
index = pd.MultiIndex.from_arrays([owners, years, brands], names=['owner', 'year', 'brand'])
data = np.random.randint(low=100, high=1000, size=len(index), dtype=int)
weight = np.random.randint(low=1, high=10, size=len(index), dtype=int)
df = pd.DataFrame({'data': data, 'weight': weight},index=index)
df.loc[('PSA', 2017, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Opel'), 'data'] = np.nan
df.loc[('PSA', 2016, 'Citroen'), 'data'] = np.nan
df.loc[('Tesla', 2016, 'Tesla'), 'data'] = np.nan
out:
data weight
owner year brand
PSA 2016 Citroen NaN 5
Opel NaN 5
Peugeot 250.0 2
2017 Citroen 469.0 4
Opel NaN 5
Peugeot 768.0 5
2018 Opel 237.0 6
Peugeot 663.0 4
Tesla 2016 Tesla NaN 3
2017 Tesla 695.0 6
2018 Tesla 371.0 5
I have tried with the index and "level" as well as with columns and "by".
And I have tried with "as_index = False" .sum() as well as with "group_keys()" = False and .apply(sum). But I am not able to get the brand column back in the groupby output:
grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)
out:
data weight group_data
owner year
PSA 2016 250.0 12.0 750.0
2017 1237.0 14.0 3711.0
2018 900.0 10.0 1800.0
Tesla 2016 0.0 3.0 0.0
2017 695.0 6.0 695.0
2018 371.0 5.0 371.0
Similar:
grouped = df.groupby(by=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.apply(sum)
or:
grouped = df.groupby(by=['owner', 'year'], as_index=False, group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
grouped.sum()
Workaround:
grouped = df.groupby(level=['owner', 'year'], group_keys=False) #type: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
df_owner_year = grouped.apply(sum)
s_data = df_owner_year['data']
df['group_data'] = df.index.map(s_data)
df
out:
data weight group_data
owner year brand
PSA 2016 Citroen NaN 5 250.0
Opel NaN 5 250.0
Peugeot 250.0 2 250.0
2017 Citroen 469.0 4 1237.0
Opel NaN 5 1237.0
Peugeot 768.0 5 1237.0
2018 Opel 237.0 6 900.0
Peugeot 663.0 4 900.0
Tesla 2016 Tesla NaN 3 0.0
2017 Tesla 695.0 6 695.0
2018 Tesla 371.0 5 371.0
You can use groupby to accomplish this.
df = df.sort_index()
print(df)
data weight
owner year brand
PSA 2016 Citroen NaN 4
Opel NaN 7
Peugeot 880.0 1
2017 Citroen 164.0 2
Opel NaN 5
Peugeot 607.0 8
2018 Opel 809.0 1
Peugeot 317.0 8
Tesla 2016 Tesla NaN 1
2017 Tesla 384.0 9
2018 Tesla 550.0 9
Groupby Owner and Year and the make your new column equal to that.
df['new'] = df.groupby(['owner', 'year'])['data'].sum()
print(df)
data weight new
owner year brand
PSA 2016 Citroen NaN 4 880.0
Opel NaN 7 880.0
Peugeot 880.0 1 880.0
2017 Citroen 164.0 2 771.0
Opel NaN 5 771.0
Peugeot 607.0 8 771.0
2018 Opel 809.0 1 1126.0
Peugeot 317.0 8 1126.0
Tesla 2016 Tesla NaN 1 0.0
2017 Tesla 384.0 9 384.0
2018 Tesla 550.0 9 550.0
EDIT
A further question was asked why when grouping by columns df['new'] returns NaN but the proper values are returned when the grouping is in the index. I posed this question on SO and an excellent answer is here by #Jezrael.
I am sure there are cases in which MultiIndex is valuable, but I usually just want to get rid of it as soon as possible, so I'd start off with df = df.reset_index().
Then you can easily group by brand, like for example:
>>> df.groupby('brand').agg({'weight': sum, 'data': sum})
# weight data
# brand
# Citroen 10 784.0
# Opel 13 193.0
# Peugeot 14 1663.0
# Tesla 18 507.0
Or group by owner and year:
>>> df.groupby(['owner', 'year']).agg({'weight': sum, 'data': sum})
weight data
# owner year
# PSA 2016 17 879.0
# 2017 8 1264.0
# 2018 12 497.0
# Tesla 2016 8 0.0
# 2017 4 151.0
# 2018 6 356.0

How can I reset pct change to NaN when another column value is different than previous row?

I have a dataframe (df) like so:
Year | Name | Count
2017 John 1
2018 John 2
2019 John 3
2017 Fred 1
2018 Fred 2
2019 Fred 3
df['pct_chg']=df['Count'].pct_change() yields
Year | Name | Count | pct_chg
2017 John 1 NaN
2018 John 2 1
2019 John 3 .5
2017 Fred 1 -.66
2018 Fred 2 1
2019 Fred 3 .5
Keeping the columns the same, is there a way to get pct_change() to restart when Name is a new value? There does not seem to be any parameters to set this in the documentation.
desired output:
Year | Name | Count | pct_chg
2017 John 1 NaN
2018 John 2 1
2019 John 3 .5
2017 Fred 1 NaN
2018 Fred 2 1
2019 Fred 3 .5
The change is superficial but it helps with eyeballing new names
you can Use:
df['pct_chg']=df.groupby([df.Name.ne(df.Name.shift()).cumsum(),'Name'])['Count'].\
apply(lambda x: x.pct_change())
print(df)
Year Name Count pct_chg
0 2017 John 1 NaN
1 2018 John 2 1.0
2 2019 John 3 0.5
3 2017 Fred 1 NaN
4 2018 Fred 2 1.0
5 2019 Fred 3 0.5

Display minimum value excluding zero along with adjacent column value from each year + Python 3+, dataframe

I have a dataframe with three columns as Year, Product, Price. I wanted to calculate minimum value excluding zero from Price from each year. Also wanted to populate adjacent value from column Product to the minimum value.
Data:
Year Product Price
2000 Grapes 0
2000 Apple 220
2000 pear 185
2000 Watermelon 172
2001 Orange 0
2001 Muskmelon 90
2001 Pear 165
2001 Watermelon 99
Desirable output in new dataframe:
Year Minimum Price Product
2000 172 Watermelon
2001 90 Muskmelon
First filter out 0 rows by boolean indexing:
df1 = df[df['Price'] != 0]
And then use DataFrameGroupBy.idxmin for indices for minimal Price per groups with selecting by loc:
df2 = df1.loc[df1.groupby('Year')['Price'].idxmin()]
Alternative is use sort_values with drop_duplicates:
df2 = df1.sort_values(['Year', 'Price']).drop_duplicates('Year')
print (df2)
Year Product Price
3 2000 Watermelon 172
5 2001 Muskmelon 90
If possible multiple minimal values and need all of them per groups:
print (df)
Year Product Price
0 2000 Grapes 0
1 2000 Apple 220
2 2000 pear 172
3 2000 Watermelon 172
4 2001 Orange 0
5 2001 Muskmelon 90
6 2001 Pear 165
7 2001 Watermelon 99
df1 = df[df['Price'] != 0]
df = df1[df1['Price'].eq(df1.groupby('Year')['Price'].transform('min'))]
print (df)
Year Product Price
2 2000 pear 172
3 2000 Watermelon 172
5 2001 Muskmelon 90
EDIT:
print (df)
Year Product Price
0 2000 Grapes 0
1 2000 Apple 220
2 2000 pear 185
3 2000 Watermelon 172
4 2001 Orange 0
5 2001 Muskmelon 90
6 2002 Pear 0
7 2002 Watermelon 0
df['Price'] = df['Price'].replace(0, np.nan)
df2 = df.sort_values(['Year', 'Price']).drop_duplicates('Year')
df2['Product'] = df2['Product'].mask(df2['Price'].isnull(), 'No data')
print (df2)
Year Product Price
3 2000 Watermelon 172.0
5 2001 Muskmelon 90.0
6 2002 No data NaN

Index Match with multiple criteria

I have a list of products ranked by percentile. I want to be able to retrieve the first value less than a specific percentile.
Product Orders Percentile Current Value Should Be
Apples 192 100.00% 29 29
Apples 185 97.62% 29 29
Apples 125 95.24% 29 29
Apples 122 92.86% 29 29
Apples 120 90.48% 29 29
Apples 90 88.10% 29 29
Apples 30 85.71% 29 29
Apples 29 83.33% 29 29
Apples 27 80.95% 29 29
Apples 25 78.57% 29 29
Apples 25 78.57% 29 29
Apples 25 78.57% 29 29
Oranges 2 100.00% 0 1
Oranges 2 100.00% 0 1
Oranges 1 60.00% 0 1
Oranges 1 60.00% 0 1
Lemons 11 100.00% 0 2
Lemons 10 88.89% 0 2
Lemons 2 77.78% 0 2
Lemons 2 77.78% 0 2
Lemons 1 55.56% 0 2
Currently my formula in the "Current Value" column is: =SUMIFS([Orders],[Product],[#[Product]],[Percentile],INDEX([Percentile],MATCH(FALSE,[Percentile]>$O$1,0))) (entered as an array formula)
$O$1 contains the percentile that I am matching (85.00%).
The current value for "Apples" (29) is correct, but as you can see my formula is not producing the correct value for the remaining products as in "Should Be" but is returning "0". Not sure how to set this up to get it to do what I need it to. I tried several things with SumProduct but couldn't get that to work either. I need someone with more experience to give me a hand on this.
You don't need the SUMIFS(), just the INDEX/MATCH:
=INDEX([Orders],MATCH(1,([Percentile]<$O$1)*([Product]=[#Product]),0))
This is an array formula and must be confirmed with Ctrl-Shift-Enter on exiting edit mode. If done properly then Excel will put {} around the formula.

Resources