Dataset with maximal rows by userId indicated - python-3.x

I have a dataframe like this:
ID date var1 var2 var3
AB 22/03/2020 0 1 3
AB 29/03/2020 0 3 3
CD 22/03/2020 0 1 1
And I would like to have a new dataset that, if it is a maximal column (can happen ties too) leaves the same number of the original dataset on the rows; otherwise set -1 if it is not the maximal. So it would be:
ID date var1 var2 var3
AB 22/03/2020 -1 -1 3
AB 29/03/2020 -1 3 3
CD 22/03/2020 -1 1 1
But I am not managing at all how to do this. What can I try next?

Select only numeric columns by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
Or select all columns without first two by positions by DataFrame.iloc:
df1 = df.iloc[:, 2:]
Or select columns with var label by DataFrame.filter:
df1 = df1.filter(like='var')
And then set new values by DataFrame.where with max:
df[df1.columns] = df1.where(df1.eq(df1.max(1), axis=0), -1)
print (df)
ID date var1 var2 var3
0 AB 22/03/2020 -1 -1 3
1 AB 29/03/2020 -1 3 3
2 CD 22/03/2020 -1 1 1

IIUC use where and date back
s=df.loc[:,'var1':]
df.update(s.where(s.eq(s.max(1),axis=0),-1))
df
ID date var1 var2 var3
0 AB 22/03/2020 -1 -1 3
1 AB 29/03/2020 -1 3 3
2 CD 22/03/2020 -1 1 1

Related

How to group a dataframe by multiple columns, sum and sort the totals in descending order?

Given the following dataframe:
user_id col1 col2
1 A 4
1 A 22
1 A 112
1 B -0.22222
1 B 9
1 C 0
2 A -1
2 A -5
2 K NA
And I want to group by user_id and col1 and count. Then to sort the counts within the groups in descending order.
Here is what I'm trying to do but I don't get the right output:
df[["user_id", "col1"]]. \
groupby(["user_id", "col1"]). \
agg(counts=("col1","count")). \
reset_index(). \
sort_values(["user_id", "col1", "counts"], ascending=False)
Please advise what should I change to make it work.
Expected output:
user_id col1 counts
1 A 3
B 2
C 1
2 A 2
K 1
Use GroupBy.size:
In [199]: df.groupby(['user_id', 'col1']).size()
Out[199]:
user_id col1
1 A 3
B 2
C 1
2 A 2
K 1
OR:
In [201]: df.groupby(['user_id', 'col1']).size().reset_index(name='counts')
Out[201]:
user_id col1 counts
0 1 A 3
1 1 B 2
2 1 C 1
3 2 A 2
4 2 K 1
EDIT:
In [206]: df.groupby(['user_id', 'col1']).agg({'col2': 'size'})
Out[206]:
col2
user_id col1
1 A 3
B 2
C 1
2 A 2
K 1
EDIT-2: For sorting, use:
In [213]: df.groupby(['user_id', 'col1'])['col2'].size().sort_values(ascending=False)
Out[213]:
user_id col1
1 A 3
2 A 2
1 B 2
2 K 1
1 C 1
Name: col2, dtype: int64
Using the main idea from Mayank answer:
df.groupby(["id_user","col1"]).size().reset_index(name="counts").sort_values(["id_user", "col1"], ascending=False)
Solved my issue.

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

groupby and trim some rows based on condition

I have a data frame something like this:
df = pd.DataFrame({"ID":[1,1,2,2,2,3,3,3,3,3],
"IF_car":[1,0,0,1,0,0,0,1,0,1],
"IF_car_history":[0,0,0,1,0,0,0,1,0,1],
"observation":[0,0,0,1,0,0,0,2,0,3]})
I want output where I can trim rows in groupby with ID and condition on "IF_car_history" == 1
tried_df = df.groupby(['ID']).apply(lambda x: x.loc[:(x['IF_car_history'] == '1').idxmax(),:]).reset_index(drop = True)
I want to drop rows in a groupby by after i get ['IF_car_history'] == '1'
expected output:
Thanks
First compare values for mask m by Series.eq and then use GroupBy.cumsum, and for values before 1 compare by 0, last filter by boolean indexing, but because id necesary remove after last 1 is used swapped values by slicing with [::-1].
m = df['IF_car_history'].eq(1).iloc[::-1]
df1 = df[m.groupby(df['ID']).cumsum().ne(0).iloc[::-1]]
print (df1)
ID IF_car IF_car_history observation
2 2 0 0 0
3 2 1 1 1
5 3 0 0 0
6 3 0 0 0
7 3 1 1 2
8 3 0 0 0
9 3 1 1 3

How to replace the values of 1's and 0's of various column into a single column of a data frame?

The 0's and 1's need to be transposed to there appropriate headers in python.
How can I achieve this and get the column final_list?
If there is always only one 1 per rows use DataFrame.dot:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,0,0],
'c':[0,0,1]})
df['Final'] = df.dot(df.columns)
print (df)
a b c Final
0 0 1 0 b
1 1 0 0 a
2 0 0 1 c
If possible multiple 1 also add separator and then remove it by Series.str.rstrip from output Series:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,1,0],
'c':[1,1,1]})
df['Final'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
a b c Final
0 0 1 1 b,c
1 1 1 1 a,b,c
2 0 0 1 c

How to filter values from a multi index Pandas data frame

I have a multi index pandas data frame df as below:
Count
Letter Direction
A -1 3
1 0
B -1 2
1 4
C -1 4
1 10
D -1 8
1 1
E -1 4
1 5
F -1 1
1 1
I want to filter out Letters which has Count < 2 in both or either one of the direction.
Tried df[df.Count < 2], but its giving the below output:
Count
Letter Direction
A 1 0
D 1 1
F -1 1
1 1
Desired output is as below,
Count
Letter Direction
A -1 3
1 0
D -1 8
1 1
F -1 1
1 1
what should i do to get the above?
Use GroupBy.transform with boolean mask and GroupBy.any - any check if at least one True per first level of MultiIndex and transform return mask with same size like original DataFrame, so possible filter by boolean indexing:
df = df[(df.Count < 2).groupby(level=0).transform('any')]
print (df)
Count
Letter Direction
A -1 3
1 0
D -1 8
1 1
F -1 1
1 1
Another solution is use MultiIndex.get_level_values for get values of Letter by condition and select by DataFrame.loc:
df = df.loc[df.index.get_level_values(0)[df.Count < 2]]
print (df)
Count
Letter Direction
A -1 3
1 0
D -1 8
1 1
F -1 1
1 1

Resources