How to calculate percentage existence - pandas-groupby

pets = pd.DataFrame.from_dict({'Izzy':['Smith','cat',1],\
'Lynx':['Smith','cat',9],\
'Oreo':['Smith','dog',7],\
'Archie':['Mack','dog',3], \
'Prim':['Mack','cat',1], \
'Fern':['Somers','cat',12]}, orient='index')
pets.columns=['family', 'type','age']
family type age
Izzy Smith cat 1
Lynx Smith cat 9
Oreo Smith dog 7
Archie Mack dog 3
Prim Mack cat 1
Fern Somers cat 12
I want to calculate the average age of each type of pet across all families, and also the percentage of families that own each type of pet. So I'm starting with this, which is easy.
pets.groupby(by='type').mean()
age
type
cat 5.75
dog 5.00
But not sure how to get the second number, in this case 100% for cats, and 67% for dogs. I'm sure I can get there in several steps, but is there an easy way to do this in the same groupby?

Try:
x = pets.groupby("type")["family"].nunique() / pets["family"].nunique() * 100
print(x)
Prints:
type
cat 100.000000
dog 66.666667
Name: family, dtype: float64

Related

Create new dataframe column where cell value indicates animal type

Suppose that I have the following dataframe called pet_stores that consists of the number of cats and dogs per location of a pet store franchise
Dog Cat City
0 5 11 NYC
1 4 1 San Francisco
How can I transform this dataframe such that instead of having separate Dog and Cat columns I have a single column called animal_type? I want the following result:
animal_type Count City
0 Dog 5 NYC
1 Dog 4 San Francisco
2 Cat 11 NYC
3 Cat 1 San Francisco
Thanks!
Use melt:
>>> df.melt('City', var_name='animal_type', value_name='Count')
City animal_type Count
0 NYC Dog 5
1 San Francisco Dog 4
2 NYC Cat 11
3 San Francisco Cat 1
Without melt and index manipulation:
>>> df.set_index('City').rename_axis(columns='animal_type') \
.stack().rename('Count').reset_index()
City animal_type Count
0 NYC Dog 5
1 NYC Cat 11
2 San Francisco Dog 4
3 San Francisco Cat 1
Use sort_values to change the order of rows.

How to find the total length of a column value that has multiple values in different rows for another column

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Join on a second column if there is not a match on the first column of a pandas dataframe

I need to be able to match on a second column if there is not a match on the first column of a pandas dataframe (Python 3.x).
Ex.
table_df = pd.DataFrame ( {
'Name': ['James','Tim','John','Emily'],
'NickName': ['Jamie','','','Em'],
'Colour': ['Blue','Black','Red','Purple']
})
lookup_df = pd.DataFrame ( {
'Name': ['Tim','John','Em','Jamie'],
'Pet': ['Cat','Dog','Fox','Dog']
})
table_df
Name NickName Colour
0 James Jamie Blue
1 Tim Black
2 John Red
3 Emily Em Purple
lookup_df
Name Pet
0 Tim Cat
1 John Dog
2 Em Fox
3 Jamie Dog
The result I need:
Name NickName Colour Pet
0 James Jamie Blue Dog
1 Tim Black Cat
2 John Red Dog
3 Emily Em Purple Fox
which is matching on the Name column, and if there is no match, match on the Nickname column,
I tried many different things, including:
pd.merge(table_df,lookup_df, how='left', left_on='Name', right_on='Name')
if Nan -> pd.merge(table_df,lookup_df, how='left', left_on='NickName', right_on='Name')
but it does not do what I need and I want to avoid having a nested loop.
Has anyone an idea on how to do this? Any feedback is really appreciated.
Thanks!
You can map on Name and fillna on NickName:
s = lookup_df.set_index("Name")["Pet"]
table_df["pet"] = table_df["Name"].map(s).fillna(table_df["NickName"].map(s))
print (table_df)
Name NickName Colour pet
0 James Jamie Blue Dog
1 Tim Black Cat
2 John Red Dog
3 Emily Em Purple Fox

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

Convert dataframe from long to wide with custom column names [duplicate]

I have data in long format and am trying to reshape to wide, but there doesn't seem to be a straightforward way to do this using melt/stack/unstack:
Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2
Becomes:
Salesman Height product_1 price_1 product_2 price_2 product_3 price_3
Knut 6 bat 5 ball 1 wand 3
Steve 5 pen 2 NA NA NA NA
I think Stata can do something like this with the reshape command.
Here's another solution more fleshed out, taken from Chris Albon's site.
Create "long" dataframe
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
Make a "wide" data
df.pivot(index='patient', columns='obs', values='score')
A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
#Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx then you could do reshape in stata:
reshape wide product price, i(salesman) j(idx)
Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:
df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Output:
Salesman Height price_1 product_1 price_2 product_2 price_3 product_3
0 Knut 6 5.0 bat 1.0 ball 3.0 wand
1 Steve 5 2.0 pen NaN NaN NaN NaN
A bit old but I will post this for other people.
What you want can be achieved, but you probably shouldn't want it ;)
Pandas supports hierarchical indexes for both rows and columns.
In Python 2.7.x ...
from StringIO import StringIO
raw = '''Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')
print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')
Produces a probably more convenient representation than what you were looking for
price
product ball bat pen wand
Salesman Height
Knut 6 1 5 NaN 3
Steve 5 NaN NaN 2 NaN
The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.
pivoted = df.pivot('salesman', 'product', 'price')
pg. 192 Python for Data Analysis
An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):
# pip install pyjanitor
import pandas as pd
import janitor
idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)
(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
Salesman Height product_1 product_2 product_3 price_1 price_2 price_3
0 Knut 6 bat ball wand 5.0 1.0 3.0
1 Steve 5 pen NaN NaN 2.0 NaN NaN

Resources