pandas pivot dataframe, but add unseen values - python-3.x

I have 2 dataframes:
purchases = pd.DataFrame([['Alice', 'sweeties', 4],
['Bob', 'chocolate', 5],
['Alice', 'chocolate', 3],
['Claudia', 'juice', 2]],
columns=['client', 'item', 'quantity'])
goods = pd.DataFrame([['sweeties', 15],
['chocolate', 7],
['juice', 8],
['lemons', 3]], columns=['good', 'price'])
and I want to transform purchases with cols and indexes alike at this photo:
My first thought was to use pivot:
purchases.pivot(columns="item", values="quantity")
Output:
The problem is: I also need the lemons column in the pivot result because it's present in the goods dataframe (just filled with None values).
How can I accomplish that?

You can chain with reindex:
purchases.pivot(columns="item", values="quantity").reindex(goods['good'], axis=1)
Output:
good sweeties chocolate juice lemons
0 4.0 NaN NaN NaN
1 NaN 5.0 NaN NaN
2 NaN 3.0 NaN NaN
3 NaN NaN 2.0 NaN

You can use df.merge with df.pivot:
In [3626]: x = goods.merge(purchases, left_on='good', right_on='item', how='left')
In [3628]: x['total'] = x.price * x.quantity # you can tweak this calculation
In [3634]: res = x[['good', 'client', 'total']].pivot('client', 'good', 'total').dropna(how='all').fillna(0)
In [3635]: res
Out[3635]:
good chocolate juice lemons sweeties
client
Alice 21.0 0.0 0.0 60.0
Bob 35.0 0.0 0.0 0.0
Claudia 0.0 16.0 0.0 0.0

Related

Why does pandas.groupby keep the key?

I would like to perform the following operations on a dataframe.
import pandas as pd
import datetime
t = pd.DataFrame({'id': [1, 1, 2, 2],
'date': [datetime.date(2020,1,1), datetime.date(2020,1,2)] * 2,
'value': [1, 2, 3, 5]})
t.groupby('id').apply(lambda df: df.set_index('date').diff())
I got the result below
id value
id date
1 2020-01-01 NaN NaN
2020-01-02 0.0 1.0
2 2020-01-01 NaN NaN
2020-01-02 0.0 2.0
My question is why the id column is kept. I expect the 'id' column disappear after this operation. What I want is
t.set_index(['id', 'date']).groupby(level=0).diff()
Out[92]:
value
id date
1 2020-01-01 NaN
2020-01-02 1.0
2 2020-01-01 NaN
2020-01-02 2.0
One idea is specify columns:
df = t.groupby('id')[['date','value']].apply(lambda df: df.set_index('date').diff())
I think reason is because used DataFrame.diff, so processing all columns in groupby.apply.

How to properly create a pandas dataframe with the given data?

I have the following experiment data:
experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'
experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'
experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'
Each experiment consists of:
A user who conducted the experiment
The task the user had to do
The device the user was using
A series of timestamps ...
and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )
The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.
The expected output format I would like to achieve is the following:
Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.
I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.
Any help is highly appreciated!
Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.
Given your variables, construct DataFrames as follows:
df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
'User': experiment1_user,
'Task': experiment1_task,
experiment1_device: experiment1_values})
df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
'User': experiment2_user,
'Task': experiment2_task,
experiment2_device: experiment2_values})
df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
'User': experiment3_user,
'Task': experiment3_task,
experiment3_device: experiment3_values})
Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:
df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()
idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique()
for i in range(df.index.nlevels)])
df = df.reindex(idx)
Dev2 Dev1
Timestamp User Task
1 Martin Ventilation NaN 5.0
Oil Level NaN NaN
Sean Ventilation NaN 10.0
Oil Level 1.0 NaN
2 Martin Ventilation NaN 6.0
Oil Level NaN NaN
Sean Ventilation NaN 11.0
Oil Level 2.0 NaN
3 Martin Ventilation NaN 7.0
Oil Level NaN NaN
Sean Ventilation NaN 12.0
Oil Level 3.0 NaN
4 Martin Ventilation NaN 8.0
Oil Level NaN NaN
Sean Ventilation NaN 13.0
Oil Level NaN NaN
6 Martin Ventilation NaN 9.0
Oil Level NaN NaN
Sean Ventilation NaN 14.0
Oil Level 4.0 NaN
Maybe there is a simpler way, but you can create sub-dataframes:
import pandas as pd
import pandas as pd
df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'
df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'
df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'
merge them into one:
df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')
and create the aggregation:
piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack() # for the columns dev1, dev2

how to select values from multiple columns based on a condition

I have a dataframe which has information about people with balance in their different accounts. It looks something like below.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'accnt_1':[2, np.nan, 13, np.nan, np.nan, np.nan],
'accnt_2':[32, np.nan, 12, 21, 32, np.nan],
'accnt_3':[11,21,np.nan,np.nan,2,np.nan]})
df
I want to get balance for each person as if accnt_1 is not empty that is the balance of that person. If accnt_1 is empty and accnt_2 is not, number in accnt_2 is the balance. If both accnt_1 and accnt_2 are empty, whatever is in accnt_3 is the balance.
In the end the output should look like
out_df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'balance':[2, 21, 13, 21, 32, np.nan]})
out_df
I will always know the priority of columns. I can write a simple function and apply on this dataframe. But I was thinking is there a better and faster way to do using pandas/numpy?
If balanced means first not missing values after name you can convert name to index, then back filling missing values and select first column by position:
df = df.set_index('name').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
print (df)
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
If need specify columns names in order by list:
cols = ['accnt_1','accnt_2','accnt_3']
df = df.set_index('name')[cols].bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
Or if need filter only accnt columns use DataFrame.filter:
df = df.set_index('name').filter(like='accnt').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
You can simply chain fillna methods onto each other to achieve your desired result. The chaining can be read in plain english closely to: "take the values in accnt_1, fill the missing values in accnt_1 with values from accnt_2. Then if there are still remaining NaN after this, fill those missing values with the values from accnt_3"
>>> df["balance"] = df["accnt_1"].fillna(df["accnt_2"]).fillna(df["accnt_3"])
>>> df[["name", "balance"]]
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
df['balance']=df.name.map(df.set_index('name').stack().groupby('name').first())
name accnt_1 accnt_2 accnt_3 balance
0 John 2.0 32.0 11.0 2.0
1 Jacob NaN NaN 21.0 21.0
2 Mary 13.0 12.0 NaN 13.0
3 Sue NaN 21.0 NaN 21.0
4 Harry NaN 32.0 2.0 32.0
5 Clara NaN NaN NaN NaN
How it works
#setting name as index gives you an opportunity to get it as a column name when you unstack
df.set_index('name').stack().groupby('name').first()
name
John accnt_1 2.0
accnt_2 32.0
accnt_3 11.0
Jacob accnt_3 21.0
Mary accnt_1 13.0
accnt_2 12.0
Sue accnt_2 21.0
Harry accnt_2 32.0
accnt_3 2.0
dtype: float64
#Chaining .first() gets you the first index value that is non NaN because when you stack NaN is dropped
df.set_index('name').stack().groupby('name').first()
#.map() allows you to map output above to the original dataframe
df.name.map(df.set_index('name').stack().groupby('name').first())
0 2.0
1 21.0
2 13.0
3 21.0
4 32.0
5 NaN

Creating a column in one dataframe from another dataframe doesn't transfer missing rows

I have the following two dataframes:
data = {'Name': ['Tom', 'Jack', 'nick', 'juli'], 'marks': [99, 98, 95, 90]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])
data = {'salata': ['ntomata', 'tzatziki']}
df2 = pd.DataFrame(data, index=['rank3', 'rank5'])
What I want is to to copy the salata column from df2 to df1.
df['salata'] = df2['salata']
However, it doesn't copy the missing row rank5 to the df1
Update: Thank you for the answers.
What should I use in case the dataframes have different column multiindex levels?
For example:
data = {('Name','Here'): ['Tom', 'Jack', 'nick', 'juli'], ('marks','There'): [99, 98, 95, 90]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])
df[('salata','-')] = df2['salata']
Use DataFrame.combine_first:
#all columns
df = df.combine_first(df2)
#only columns in list
#df = df.combine_first(df2[['salata']])
print (df)
Name marks salata
rank1 Tom 99.0 NaN
rank2 Jack 98.0 NaN
rank3 nick 95.0 ntomata
rank4 juli 90.0 NaN
rank5 NaN NaN tzatziki
EDIT:
If there is MultiIndex first create MultiIndex in df2, e.g. by MultiIndex.from_product:
df2.columns = pd.MultiIndex.from_product([[''], df2.columns])
df = df.combine_first(df2)
print (df)
Name marks
salata Here There
rank1 NaN Tom 99.0
rank2 NaN Jack 98.0
rank3 ntomata nick 95.0
rank4 NaN juli 90.0
rank5 tzatziki NaN NaN
Another solution with concat:
df = pd.concat([df, df2], axis=1)
if your indexes are representative of your example then you can do an outer join :
df = df.join(df2,how='outer')
Name marks salata
rank1 Tom 99.0 NaN
rank2 Jack 98.0 NaN
rank3 nick 95.0 ntomata
rank4 juli 90.0 NaN
rank5 NaN NaN tzatziki

Pandas pivot with multiple items per column, how to avoid aggregating them?

Follow up to this question, in particular this comment.
Consider following dataframe:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0],
})
Which looks like this:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
Using a pivot_table() is a nice way to reshape this data that will allow querying it by Person and see all of their belongings in a single row, making it really easy to answer queries such as "How to find the Value of Persons Car, if they have a House valued more than 400.0?"
A pivot_table() can be easily built for this data set with:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
)
Which will look like:
Belonging Bike Car House
Person
Adam NaN 10.0 300.0
Cesar NaN 12.0 NaN
Diana 2.0 15.0 450.0
Erika NaN 11.0 600.0
But this gets limited when a Person has more than one of the same type of Belonging, for example two Cars, two Houses or two Bikes.
Consider the updated data:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika', 'Diana', 'Adam'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car', 'Car', 'House'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0, 21.0, 180.0],
})
Which looks like:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
8 Diana Car 21.0
9 Adam House 180.0
Now that same pivot_table() will return the average of Diana's two cars, or Adam's two houses:
Belonging Bike Car House
Person
Adam NaN 10.0 240.0
Cesar NaN 12.0 NaN
Diana 2.0 18.0 450.0
Erika NaN 11.0 600.0
So we can pass pivot_table() an aggfunc='sum' or aggfunc=np.sum to get the sum rather than the average, which will give us 480.0 and 36.0 and is probably a better representation of the total value a Person owns in Belongings of a certain type. But we're missing details.
We can use aggfunc=list which will preserve them:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
Belonging Bike Car House
Person
Adam NaN [10.0] [300.0, 180.0]
Cesar NaN [12.0] NaN
Diana [2.0] [15.0, 21.0] [450.0]
Erika NaN [11.0] [600.0]
This keeps the detail about multiple Belongings per Person, but on the other hand is quite inconvenient in that it is using Python lists rather than native Pandas types and columns, so it makes some queries such as the total Values in Houses difficult to answer.
Using aggfunc=np.sum, we could simply use pd_pivot['House'].sum() to get the total of 1530.0. Even questions such as the one above, Cars for Persons with a House worth more than 400.0 are now harder to answer.
What's a better way to reshape this data that will:
Allow easy querying a Person's Belongings in a single row, like the pivot_table() does;
Preserve the details of Persons who have multiple Belongings of a certain type;
Use native Pandas columns and data types that make it possible to use Pandas methods for querying and summarizing the data.
I thought of updating the Belonging descriptions to include a counter, such as "House 1", "Car 2", etc. Perhaps sorting so that the most valuable one comes first (to help answer questions such as "has a house worth more than 400.0" looking at "House 1" only.)
Or perhaps using a pd.MultiIndex to still be able to access all "House" columns together.
But unsure how to actually reshape the data in such a way.
Or are there better suggestions on how to reshape it (other than adding a count per belonging) that would preserve the features described above? How would you reshape it and how would you answer all these queries I mentioned above?
Perhaps sth like this:
given your Pivot table in the following dataframe:
pv = df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
then apply pd.Series to all columns.
For proper naming of columns, calculate maximum length of lists in each column and then use 'set_axis' for renaming:
new_pv = pd.DataFrame(index=pv.index)
for col in pv:
n = int(pv[col].str.len().max())
new_pv = pd.concat([new_pv, pv[col].apply(pd.Series).set_axis([f'{col}_{i}' for i in range(n)], 1, inplace = False)], 1)
# Bike_0 Car_0 Car_1 House_0 House_1
# Person
# Adam NaN 10.0 NaN 300.0 180.0
# Cesar NaN 12.0 NaN NaN NaN
# Diana 2.0 15.0 21.0 450.0 NaN
# Erika NaN 11.0 NaN 600.0 NaN
counting of houses:
new_pv.filter(like='House').count(1)
# Person
# Adam 2
# Cesar 0
# Diana 1
# Erika 1
# dtype: int64
sum of all house's values:
new_pv.filter(like='House').sum().sum()
# 1530.0
Using groupby, you could achieve something like this.
df_new = df.groupby(['Person', 'Belonging']).agg(('sum', 'count', 'min', 'max'))
which would give.
Value
sum count min max
Person Belonging
Adam Car 10.0 1 10.0 10.0
House 480.0 2 180.0 300.0
Cesar Car 12.0 1 12.0 12.0
Diana Bike 2.0 1 2.0 2.0
Car 36.0 2 15.0 21.0
House 450.0 1 450.0 450.0
Erika Car 11.0 1 11.0 11.0
House 600.0 1 600.0 600.0
You could define your own functions in the .agg method to provide more suitable descriptions also.
Edit
Alternatively, you could try
df['Belonging'] = df["Belonging"] + "_" + df.groupby(['Person','Belonging']).cumcount().add(1).astype(str)
Person Belonging Value
0 Adam House_1 300.0
1 Adam Car_1 10.0
2 Cesar Car_1 12.0
3 Diana House_1 450.0
4 Diana Car_1 15.0
5 Diana Bike_1 2.0
6 Erika House_1 600.0
7 Erika Car_1 11.0
8 Diana Car_2 21.0
9 Adam House_2 180.0
Then you can just use pivot
df.pivot('Person', 'Belonging')
Value
Belonging Bike_1 Car_1 Car_2 House_1 House_2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 15.0 21.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
I ended up working out a solution to this one, inspired by the excellent answers by #SpghttCd and #Josmoor98, but with a couple differences:
Using a MultiIndex, so I have a really easy way to get all Houses or all Cars.
Sorting values, so looking at the first House or Car can be used to tell who has one worth more than X.
Code for the pivot table:
df_pivot = (df
.assign(BelongingNo=df
.sort_values(by='Value', ascending=False)
.groupby(['Person', 'Belonging'])
.cumcount() + 1
)
.pivot_table(
values='Value',
index='Person',
columns=['Belonging', 'BelongingNo'],
)
)
Resulting DataFrame:
Belonging Bike Car House
BelongingNo 1 1 2 1 2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 21.0 15.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
Queries are pretty straightforward.
For example, finding the Value of Person's Cars, if they have a House valued more than 400.0:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
]
Result:
BelongingNo 1 2
Person
Diana 21.0 15.0
Erika 11.0 NaN
The average Car price for them:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
].stack().mean()
Result: 15.6666
Here, using stack() is a powerful way to flatten the second level of the MultiIndex, after having used the top-level to select a Belonging column.
Same is useful to get the total Value of all Houses:
df_pivot['House'].sum()
Results in the expected 1530.0.
Finally, looking at all Belongings of a single Person:
df_pivot.loc['Adam'].dropna()
Returns the expected two Houses and the one Car, with their respective Values.
I tried doing this with the lists in the dataframe, so that they get converted to ndarrays.
pd_df_pivot = df_pivot.copy(deep=True)
for row in range(0,df_pivot.shape[0]):
for col in range(0,df_pivot.shape[1]):
if type(df_pivot.iloc[row,col]) is list:
pd_df_pivot.iloc[row,col] = np.array(df_pivot.iloc[row,col])
else:
pd_df_pivot.iloc[row,col] = df_pivot.iloc[row,col]

Resources