Why does pandas.groupby keep the key? - python-3.x

I would like to perform the following operations on a dataframe.
import pandas as pd
import datetime
t = pd.DataFrame({'id': [1, 1, 2, 2],
'date': [datetime.date(2020,1,1), datetime.date(2020,1,2)] * 2,
'value': [1, 2, 3, 5]})
t.groupby('id').apply(lambda df: df.set_index('date').diff())
I got the result below
id value
id date
1 2020-01-01 NaN NaN
2020-01-02 0.0 1.0
2 2020-01-01 NaN NaN
2020-01-02 0.0 2.0
My question is why the id column is kept. I expect the 'id' column disappear after this operation. What I want is
t.set_index(['id', 'date']).groupby(level=0).diff()
Out[92]:
value
id date
1 2020-01-01 NaN
2020-01-02 1.0
2 2020-01-01 NaN
2020-01-02 2.0

One idea is specify columns:
df = t.groupby('id')[['date','value']].apply(lambda df: df.set_index('date').diff())
I think reason is because used DataFrame.diff, so processing all columns in groupby.apply.

Related

How to properly create a pandas dataframe with the given data?

I have the following experiment data:
experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'
experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'
experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'
Each experiment consists of:
A user who conducted the experiment
The task the user had to do
The device the user was using
A series of timestamps ...
and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )
The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.
The expected output format I would like to achieve is the following:
Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.
I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.
Any help is highly appreciated!
Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.
Given your variables, construct DataFrames as follows:
df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
'User': experiment1_user,
'Task': experiment1_task,
experiment1_device: experiment1_values})
df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
'User': experiment2_user,
'Task': experiment2_task,
experiment2_device: experiment2_values})
df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
'User': experiment3_user,
'Task': experiment3_task,
experiment3_device: experiment3_values})
Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:
df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()
idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique()
for i in range(df.index.nlevels)])
df = df.reindex(idx)
Dev2 Dev1
Timestamp User Task
1 Martin Ventilation NaN 5.0
Oil Level NaN NaN
Sean Ventilation NaN 10.0
Oil Level 1.0 NaN
2 Martin Ventilation NaN 6.0
Oil Level NaN NaN
Sean Ventilation NaN 11.0
Oil Level 2.0 NaN
3 Martin Ventilation NaN 7.0
Oil Level NaN NaN
Sean Ventilation NaN 12.0
Oil Level 3.0 NaN
4 Martin Ventilation NaN 8.0
Oil Level NaN NaN
Sean Ventilation NaN 13.0
Oil Level NaN NaN
6 Martin Ventilation NaN 9.0
Oil Level NaN NaN
Sean Ventilation NaN 14.0
Oil Level 4.0 NaN
Maybe there is a simpler way, but you can create sub-dataframes:
import pandas as pd
import pandas as pd
df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'
df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'
df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'
merge them into one:
df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')
and create the aggregation:
piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack() # for the columns dev1, dev2

pandas pivot dataframe, but add unseen values

I have 2 dataframes:
purchases = pd.DataFrame([['Alice', 'sweeties', 4],
['Bob', 'chocolate', 5],
['Alice', 'chocolate', 3],
['Claudia', 'juice', 2]],
columns=['client', 'item', 'quantity'])
goods = pd.DataFrame([['sweeties', 15],
['chocolate', 7],
['juice', 8],
['lemons', 3]], columns=['good', 'price'])
and I want to transform purchases with cols and indexes alike at this photo:
My first thought was to use pivot:
purchases.pivot(columns="item", values="quantity")
Output:
The problem is: I also need the lemons column in the pivot result because it's present in the goods dataframe (just filled with None values).
How can I accomplish that?
You can chain with reindex:
purchases.pivot(columns="item", values="quantity").reindex(goods['good'], axis=1)
Output:
good sweeties chocolate juice lemons
0 4.0 NaN NaN NaN
1 NaN 5.0 NaN NaN
2 NaN 3.0 NaN NaN
3 NaN NaN 2.0 NaN
You can use df.merge with df.pivot:
In [3626]: x = goods.merge(purchases, left_on='good', right_on='item', how='left')
In [3628]: x['total'] = x.price * x.quantity # you can tweak this calculation
In [3634]: res = x[['good', 'client', 'total']].pivot('client', 'good', 'total').dropna(how='all').fillna(0)
In [3635]: res
Out[3635]:
good chocolate juice lemons sweeties
client
Alice 21.0 0.0 0.0 60.0
Bob 35.0 0.0 0.0 0.0
Claudia 0.0 16.0 0.0 0.0

how to select values from multiple columns based on a condition

I have a dataframe which has information about people with balance in their different accounts. It looks something like below.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'accnt_1':[2, np.nan, 13, np.nan, np.nan, np.nan],
'accnt_2':[32, np.nan, 12, 21, 32, np.nan],
'accnt_3':[11,21,np.nan,np.nan,2,np.nan]})
df
I want to get balance for each person as if accnt_1 is not empty that is the balance of that person. If accnt_1 is empty and accnt_2 is not, number in accnt_2 is the balance. If both accnt_1 and accnt_2 are empty, whatever is in accnt_3 is the balance.
In the end the output should look like
out_df = pd.DataFrame({'name':['John', 'Jacob', 'Mary', 'Sue', 'Harry', 'Clara'],
'balance':[2, 21, 13, 21, 32, np.nan]})
out_df
I will always know the priority of columns. I can write a simple function and apply on this dataframe. But I was thinking is there a better and faster way to do using pandas/numpy?
If balanced means first not missing values after name you can convert name to index, then back filling missing values and select first column by position:
df = df.set_index('name').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
print (df)
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
If need specify columns names in order by list:
cols = ['accnt_1','accnt_2','accnt_3']
df = df.set_index('name')[cols].bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
Or if need filter only accnt columns use DataFrame.filter:
df = df.set_index('name').filter(like='accnt').bfill(axis=1).iloc[:, 0].rename('balance').reset_index()
You can simply chain fillna methods onto each other to achieve your desired result. The chaining can be read in plain english closely to: "take the values in accnt_1, fill the missing values in accnt_1 with values from accnt_2. Then if there are still remaining NaN after this, fill those missing values with the values from accnt_3"
>>> df["balance"] = df["accnt_1"].fillna(df["accnt_2"]).fillna(df["accnt_3"])
>>> df[["name", "balance"]]
name balance
0 John 2.0
1 Jacob 21.0
2 Mary 13.0
3 Sue 21.0
4 Harry 32.0
5 Clara NaN
df['balance']=df.name.map(df.set_index('name').stack().groupby('name').first())
name accnt_1 accnt_2 accnt_3 balance
0 John 2.0 32.0 11.0 2.0
1 Jacob NaN NaN 21.0 21.0
2 Mary 13.0 12.0 NaN 13.0
3 Sue NaN 21.0 NaN 21.0
4 Harry NaN 32.0 2.0 32.0
5 Clara NaN NaN NaN NaN
How it works
#setting name as index gives you an opportunity to get it as a column name when you unstack
df.set_index('name').stack().groupby('name').first()
name
John accnt_1 2.0
accnt_2 32.0
accnt_3 11.0
Jacob accnt_3 21.0
Mary accnt_1 13.0
accnt_2 12.0
Sue accnt_2 21.0
Harry accnt_2 32.0
accnt_3 2.0
dtype: float64
#Chaining .first() gets you the first index value that is non NaN because when you stack NaN is dropped
df.set_index('name').stack().groupby('name').first()
#.map() allows you to map output above to the original dataframe
df.name.map(df.set_index('name').stack().groupby('name').first())
0 2.0
1 21.0
2 13.0
3 21.0
4 32.0
5 NaN

Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?)

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
lead=0
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
df
Gives:
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
print(dftt[indicator])
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
Any pointers will be greatly appreciated.
I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Resources