The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
Related
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I have two dataframes like given below.
***df1 = pd.DataFrame({'a':[1,1,2,2,3,3], 'b':[1,2,1,2,1,2], 'c':[1,2,4,0,0,2]})***
df1
a b c
0 1 1 1
1 1 2 2
2 2 1 4
3 2 2 0
4 3 1 0
5 3 2 2
***df2 = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,1,2], 'c':[1,5,6,2]})***
df2
a b c
0 1 1 1
1 1 2 5
2 2 1 6
3 2 2 2
I want to apply inner join of the both data frames and don't want the columns from df2, so tried with below code.
***merged_df = df1.merge(df2, how='inner', left_on=["a", "b"], right_on=["a","b"])***
a b c_x c_y
0 1 1 1 1
1 1 2 2 5
2 2 1 4 6
3 2 2 0 2
from the above code without droping c_x and c_y manually, is there any way to not to merge right dataframe(df2)
basically, I want all the columns from df1 and don't want any columns from df2 after merging.
Thanks in advance.
Idea is filter only columns for merging, here a,b. If want merge by both columns on parameter should be omit (then pandas merge by intersection of columns in both DataFrames):
merged_df = df1.merge(df2[["a", "b"]])
working like:
merged_df = df1.merge(df2[["a", "b"]], on=['a','b'])
Suppose I have dataframe like this
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
Now I want top all records from each group using group id except last 3. That means I want to drop last 3 records from all groups. How can I do it using pandas group_by. This is dummy data.
Use GroupBy.cumcount for counter from back by ascending=False and then compare by Series.gt for greater values like 2, because python count from 0:
df = df[df.groupby('id').cumcount(ascending=False).gt(2)]
print (df)
id value
3 2 1
Details:
print (df.groupby('id').cumcount(ascending=False))
0 2
1 1
2 0
3 3
4 2
5 1
6 0
7 0
8 0
dtype: int64
in my dataframe have three columns columns value ,ID and distance . i want to check in ID column when its changes from 2 to any other value count rows and record first value and last value when 2 changes to other value and save and also save corresponding value of column distance when change from 2 to other in ID column.
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],'ID':[2,2,8,8,8,2,2,2,5,5],'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 2 0
1 4 2 0
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
required results:
df_out=pd.DataFrame({'rows_Count':[3,2],'value_first':[7,15],'value_last':[11,16],'distance_first':[1,0]})
print(df_out)
rows_Count value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Use:
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Verify in changed data (not only 2 first group):
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],
'ID':[1,7,8,8,8,2,2,2,5,5],
'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 1 0 <- changed ID
1 4 7 0 <- changed ID
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 2 15 16 0
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 1 year ago.
Okay this is tricky. I have a pandas dataframe and I am dealing with machine log data. I have an index in the data, but this dataframe has various jobs in it. I wanted to be able to give those individual jobs an index of their own, so that i could compare them with each other. So I want another column with an index beginning with zero, which goes till the end of the job and then resets to zero for the new job. Or do i do this line by line?
I think you need set_index with cumcount for count categories:
df = df.set_index(df.groupby('Job Columns').cumcount(), append=True)
Sample:
np.random.seed(456)
df = pd.DataFrame({'Jobs':np.random.choice(['a','b','c'], size=10)})
#solution with sorting
df1 = df.sort_values('Jobs').reset_index(drop=True)
df1 = df1.set_index(df1.groupby('Jobs').cumcount(), append=True)
print (df1)
Jobs
0 0 a
1 1 a
2 2 a
3 0 b
4 1 b
5 2 b
6 3 b
7 0 c
8 1 c
9 2 c
#solution with no sorting
df2 = df.set_index(df.groupby('Jobs').cumcount(), append=True)
print (df2)
Jobs
0 0 b
1 1 b
2 0 c
3 0 a
4 1 c
5 2 c
6 1 a
7 2 b
8 2 a
9 3 b