Using logical comparison together with groupby in pandas - python-3.x

I have the following dataframe:
{'item': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'B',
6: 'B',
7: 'B',
8: 'B',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'C',
14: 'C',
15: 'C',
16: 'C',
17: 'D',
18: 'D'},
'Date': {0: Timestamp('2021-05-02 00:00:00'),
1: Timestamp('2021-05-02 00:00:00'),
2: Timestamp('2021-05-02 00:00:00'),
3: Timestamp('2021-05-03 00:00:00'),
4: Timestamp('2021-06-13 00:00:00'),
5: Timestamp('2021-05-03 00:00:00'),
6: Timestamp('2021-05-04 00:00:00'),
7: Timestamp('2021-05-05 00:00:00'),
8: Timestamp('2021-05-06 00:00:00'),
9: Timestamp('2021-05-07 00:00:00'),
10: Timestamp('2021-05-08 00:00:00'),
11: Timestamp('2021-05-09 00:00:00'),
12: Timestamp('2021-05-10 00:00:00'),
13: Timestamp('2021-06-14 00:00:00'),
14: Timestamp('2021-06-15 00:00:00'),
15: Timestamp('2021-06-16 00:00:00'),
16: Timestamp('2021-07-23 00:00:00'),
17: Timestamp('2021-07-07 00:00:00'),
18: Timestamp('2021-07-08 00:00:00')},
'price': {0: 249,
1: 249,
2: 253,
3: 260,
4: 260,
5: 13,
6: 13,
7: 13,
8: 13,
9: 17,
10: 17,
11: 17,
12: 17,
13: 123,
14: 123,
15: 123,
16: 123,
17: 12,
18: 12}}
which looks like this:
item Date price
0 A 2021-05-02 249
1 A 2021-05-02 249
2 A 2021-05-02 253
3 A 2021-05-03 260
4 A 2021-06-13 260
5 B 2021-05-03 13
6 B 2021-05-04 13
7 B 2021-05-05 13
8 B 2021-05-06 13
9 B 2021-05-07 17
10 B 2021-05-08 17
11 B 2021-05-09 17
12 B 2021-05-10 17
13 C 2021-06-14 123
14 C 2021-06-15 123
15 C 2021-06-16 123
16 C 2021-07-23 123
17 D 2021-07-07 12
18 D 2021-07-08 12
As you might see, the price of an item changes over time. What I want to do is to have a column that indicates when a price changes for each item. Now, My first idea was that I could check if the price in the previous row is the same as in the current row (within) a group.
Now, I was convinced that I could do something like this:
df_changes['changed'] = df_changes.groupby(['item'])['price'].eq(df_changes['price'])
to compare row values within a group (returning a boolean) and then translating this to integers to get:
change_item_num diffsum Step
0 0 0 0
1 1 0 0
2 1 1 1
3 1 1 2
4 1 0 2
5 0 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 1 1
10 1 0 1
11 1 0 1
12 1 0 1
13 0 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 0 0 0
18 1 0 0
where the step column marks changes.
But I was wrong. Whatever I do, I get the error:
AttributeError: 'SeriesGroupBy' object has no attribute 'eq'
Instead, I found a workaround that I am very unhappy about:
j = df_changes.price
k = df_changes.item_num
df_changes['change_price'] = j.eq(j.shift()).astype(int)
df_changes['change_item_num'] = k.eq(k.shift()).astype(int)
df_changes['diffsum'] = abs(df_changes['change_price']-df_changes['change_item_num'])
df_changes['Step'] = df_changes.groupby('item')['diffsum'].cumsum()+1
which returns:
item Date price item_num change_price change_item_num diffsum \
0 A 2021-05-02 249 1 0 0 0
1 A 2021-05-02 249 1 1 1 0
2 A 2021-05-02 253 1 0 1 1
3 A 2021-05-03 260 1 0 1 1
4 A 2021-06-13 260 1 1 1 0
5 B 2021-05-03 13 2 0 0 0
6 B 2021-05-04 13 2 1 1 0
7 B 2021-05-05 13 2 1 1 0
8 B 2021-05-06 13 2 1 1 0
9 B 2021-05-07 17 2 0 1 1
10 B 2021-05-08 17 2 1 1 0
11 B 2021-05-09 17 2 1 1 0
12 B 2021-05-10 17 2 1 1 0
13 C 2021-06-14 123 3 0 0 0
14 C 2021-06-15 123 3 1 1 0
15 C 2021-06-16 123 3 1 1 0
16 C 2021-07-23 123 3 1 1 0
17 D 2021-07-07 12 4 0 0 0
18 D 2021-07-08 12 4 1 1 0
Step
0 1
1 1
2 2
3 3
4 3
5 1
6 1
7 1
8 1
9 2
10 2
11 2
12 2
13 1
14 1
15 1
16 1
17 1
18 1
Surely, there must be an easier way. If not, can anyone explain WHY I cannot use eq or any other logical comaprison within a groupby?
Thankful for any new knowledge!

Compare the current row with the previous row in the price column to identify the locations where price changes, then group the mask by the item column and calculate cumulative sum to assign the sequence of numbers to each group identifying the change in price column per item
m = df['price'] != df['price'].shift()
df['step'] = m.groupby(df['item']).cumsum()
print(df)
item Date price step
0 A 2021-05-02 249 1
1 A 2021-05-02 249 1
2 A 2021-05-02 253 2
3 A 2021-05-03 260 3
4 A 2021-06-13 260 3
5 B 2021-05-03 13 1
6 B 2021-05-04 13 1
7 B 2021-05-05 13 1
8 B 2021-05-06 13 1
9 B 2021-05-07 17 2
10 B 2021-05-08 17 2
11 B 2021-05-09 17 2
12 B 2021-05-10 17 2
13 C 2021-06-14 123 1
14 C 2021-06-15 123 1
15 C 2021-06-16 123 1
16 C 2021-07-23 123 1
17 D 2021-07-07 12 1
18 D 2021-07-08 12 1

Related

Set upperbound in a column for a specific group by using Python

I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'ID': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'Salary': [1, 2, 3, 4, 5,6,7,8,9,10, 1, 2, 3,4,5,6, 1, 2, 3, 4,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Now, for every ID/group, I wish to set an upperbound for some value of 'Salary'.
For example,
For ID=1, the upperbound of 'Salary' should be set at 4
For ID=2, the upperbound of 'Salary' should be set at 3
For ID=3, the upperbound of 'Salary' should be set at 5
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?
Use custom function with mapping by helper dictionary in GroupBy.transform:
d = {1:4, 2:3, 3:5}
def f(x):
x.iloc[:d[x.name]] = d[x.name]
return x
df['Salary'] = df.groupby('ID')['Salary'].transform(f)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
Another idea is use GroupBy.cumcount for counter per ID, compared by mapped ID and if match set mapped Series by Series.mask:
d = {1:4, 2:3, 3:5}
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df.groupby('ID').cumcount().lt(s), s)
Or if counter column is in Salary is possible use:
s = df['ID'].map(d)
df['Salary'] = df['Salary'].mask(df['Salary'].le(s), s)
print (df)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89
One option is to create a series from the dictionary, merge with the dataframe and then update the Salary column conditionally:
ser = pd.Series(d, name = 'd')
ser.index.name = 'ID'
(df
.merge(ser, on = 'ID')
.assign(Salary = lambda f: np.where(f.Salary.lt(f.d), f.d, f.Salary))
.drop(columns='d')
)
ID Salary Children Days
0 1 4 No 123
1 1 4 Yes 128
2 1 4 Yes 66
3 1 4 Yes 120
4 1 5 No 141
5 1 6 No 123
6 1 7 Yes 128
7 1 8 Yes 66
8 1 9 Yes 120
9 1 10 No 141
10 2 3 Yes 52
11 2 3 Yes 96
12 2 3 No 120
13 2 4 Yes 141
14 2 5 Yes 52
15 2 6 Yes 96
16 3 5 Yes 120
17 3 5 Yes 141
18 3 5 No 123
19 3 5 Yes 15
20 3 5 No 85
21 3 6 Yes 36
22 3 7 Yes 58
23 3 8 No 89

How to return first item when the items in the pandas dataframe window are the same?

I am a python beginner.
I have the following pandas DataFrame, with only two columns; "Time" and "Input".
I want to loop over the "Input" column. Assuming we have a window size w= 3. (three consecutive values) such that for every selected window, we will check if all the items/elements within that window are 1's, then return the first item as 1 and change the remaining values to 0's.
index Time Input
0 11 0
1 22 0
2 33 0
3 44 1
4 55 1
5 66 1
6 77 0
7 88 0
8 99 0
9 1010 0
10 1111 1
11 1212 1
12 1313 1
13 1414 0
14 1515 0
My intended output is as follows
index Time Input What_I_got What_I_Want
0 11 0 0 0
1 22 0 0 0
2 33 0 0 0
3 44 1 1 1
4 55 1 1 0
5 66 1 1 0
6 77 1 1 1
7 88 1 0 0
8 99 1 0 0
9 1010 0 0 0
10 1111 1 1 1
11 1212 1 0 0
12 1313 1 0 0
13 1414 0 0 0
14 1515 0 0 0
What should I do to get the desired output? Am I missing something in my code?
import pandas as pd
import re
pd.Series(list(re.sub('111', '100', ''.join(df.Input.astype(str))))).astype(int)
Out[23]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 1
11 0
12 0
13 0
14 0
dtype: int32

Pandas: Creating column by adding 1 to the repeated elements, minus the 1st element that it finds when it is not consecutive, Python

I'm developing the following code in Python using Pandas:
import pandas as pd
data = {"Value": [4, 4, 2, 1, 1, 1, 0, 7, 0, 4, 1, 1, 3, 0, 3, 0, 7, 0, 4, 1, 0, 1, 0, 1, 4, 4, 2, 3],
"IdPar": [0, 0, 0, 0, 0, 0, 10, 10, 10, 10, 10, 0, 0, 22, 22, 28, 28, 28, 28, 0, 0, 38, 38 , 0, 0, 0, 0, 0]
}
df = pd.DataFrame(data)
df['Count'] = df.groupby('IdPar')['IdPar'].cumcount() + 1
df.loc [df['IdPar'] == 0, 'Count'] = 0
df['Substract'] = df.index - df['Count'] ## Paint the subtraction, but it should not include the first repeated element
I want to achieve the following output, which is the Final column, so that the result of this is the addition of 1 in each repeated element in the Substract variable, but after the first value found in the column, as long as the previous value in the row is not consecutive:
Value IdPar Count Substract Final
0 4 0 0 0 0
1 4 0 0 1 1
2 2 0 0 2 2
3 1 0 0 3 3
4 1 0 0 4 4
5 1 0 0 5 5
6 0 10 1 5 6
7 7 10 2 5 6
8 0 10 3 5 6
9 4 10 4 5 6
10 1 10 5 5 6
11 1 0 0 11 11
12 3 0 0 12 12
13 0 22 1 12 13
14 3 22 2 12 13
15 0 28 1 14 14
16 7 28 2 14 14
17 0 28 3 14 14
18 4 28 4 14 14
19 1 0 0 19 19
20 0 0 0 20 20
21 1 38 1 20 21
22 0 38 2 20 21
23 1 0 0 23 23
24 4 0 0 24 24
25 4 0 0 25 25
26 2 0 0 26 26
27 3 0 0 27 27
I already checked various Pandas functions like df['Final'] = df['Substract'].loc[lambda x: x > df['Substract'].duplicated()] or apply(lambda) but I get an error; I know that it can be done with Pandas functions, but I can't find how to achieve it. If anyone can help me, I'll be very grateful. Regards.
Use shift-
df['Substract'] + (df['Substract'].shift() == df['Substract'])
OR (thanks to #SeaBean)
df['Substract'] + (df['Substract'].diff() == 0)
Output
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 6
8 6
9 6
10 6
11 11
12 12
13 13
14 13
15 14
16 15
17 15
18 15
19 19
20 20
21 21
22 21
23 23
24 24
25 25
26 26
27 27
Name: Substract, dtype: int64

Pandas how to get top n group by flag column

I have dataframe like below.
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4], 'match': [1,1,1,1,1,1,1,1,1,1]})
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
I want to get top n group like below (n=3).
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
My data, in actually, each row have another information to use, so only sort to num of match, and extract top n.
How to do this?
I believe you need if need top3 groups per column match - use SeriesGroupBy.value_counts with GroupBy.head for top3 per groups and then convert index to DataFrame by Index.to_frame and DataFrame.merge:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df = s.index.to_frame().reset_index(drop=True).merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Or if need filter only values if match is 1 use Series.value_counts with filtering by boolean indexing:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df = s.index.to_frame(name='group').merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Solution with isin and ordered categoricals:
#if need filter match == 1
idx = df.loc[df['match'] == 1, 'group'].value_counts().head(3).index
#if dont need filter
#idx = df.group.value_counts().head(3).index
df = df[df.group.isin(idx)]
df['group'] = pd.CategoricalIndex(df['group'], ordered=True, categories=idx)
df = df.sort_values('group')
print (df)
group match
0 1 1
2 1 1
5 1 1
8 1 1
6 4 1
7 4 1
9 4 1
3 3 1
4 3 1
Difference in solutions is best seen in changed data of match column:
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4,10,20,10,20,10,30,40],
'match': [1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]})
print (df)
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
10 10 0
11 20 0
12 10 0
13 20 0
14 10 0
15 30 0
16 40 0
Top3 values per groups by match:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df1 = s.index.to_frame().reset_index(drop=True).merge(df)
print (df1)
group match
0 10 0
1 10 0
2 10 0
3 20 0
4 20 0
5 30 0
6 1 1
7 1 1
8 1 1
9 1 1
10 4 1
11 4 1
12 4 1
13 3 1
14 3 1
Top3 values by match == 1:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df2 = s.index.to_frame(name='group').merge(df)
print (df2)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Top3 values, match column is not important:
s = df['group'].value_counts().head(3)
df3 = s.index.to_frame(name='group').merge(df)
print (df3)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 10 0
5 10 0
6 10 0
7 4 1
8 4 1
9 4 1

Sorting dataframe and creating new columns based on the rank of element

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep
Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1
This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1

Resources