Python: groupby multiple columns and generate count column - python-3.x

input dataframe
row name Min
11 AA 0.3
11 AA 0.2
11 BB 0.3
11 CC 0.2
12 AS 0.3
12 BE 0.3
12 BE 0.4
need to generate new column 'count' which holds info on number of times each row-name combo occurs.
Expected Output
row name Min Count
11 AA 0.3 2
11 AA 0.2 2
11 BB 0.3 1
11 CC 0.2 1
12 AS 0.3 1
12 BE 0.3 2
12 BE 0.4 2

Use the below code to generate the column count using transform
df['count'] = df.groupby(['row', 'name'])["Min"].transform("count")
row name Min count
0 11 AA 0.3 2
1 11 AA 0.2 2
2 11 BB 0.3 1
3 11 CC 0.2 1
4 12 AS 0.3 1
5 12 BE 0.3 2
6 12 BE 0.4 2

merge in the counts?
df.merge(df.groupby(['row', 'name']).size().reset_index().rename(columns={0:'Count'}), on=['row','name'])
row name Min Count
0 11 AA 0.3 2
1 11 AA 0.2 2
2 11 BB 0.3 1
3 11 CC 0.2 1
4 12 AS 0.3 1
5 12 BE 0.3 2
6 12 BE 0.4 2

Related

Change value of a specific column on dataframe subgroups in pandas based on condition

I have a dataframe similar to the one below:
A B C
1 0 0.0
1 2 0.2
1 3 1.0
2 1 0.2
2 4 0.0
2 6 1.0
3 1 0.4
3 2 1.0
3 0 0.9
3 3 0.0
Now, for each subgroup, where a subgroup will have a shared A value, I want to find the row that has the minimum B value, then change the value of C for that row to 0.5. In this case, I would obtain a new dataframe:
A B C
1 0 0.5
1 2 0.2
1 3 1.0
2 1 0.5
2 4 0.0
2 6 1.0
3 1 0.4
3 2 1.0
3 0 0.5
3 3 0.0
As an addendum, if this operation replaces a 0.0 or 1.0 in the C column, then I'd like for the row to be duplicated with its old value. In this case, the A=1 subgroup infringes this rule (0.0 is replaced with 0.5) and therefore should produce:
A B C
1 0 0.0
1 0 0.5
1 2 0.2
1 3 1.0
...
The first problem is the main one, the second one isn't a priority, but of course, would welcome help with either.
Try:
df.loc[df.groupby('A')['B'].idxmin(), 'C'] = 0.5
Output:
A B C
0 1 0 0.5
1 1 2 0.2
2 1 3 1.0
3 2 1 0.5
4 2 4 0.0
5 2 6 1.0
6 3 1 0.4
7 3 2 1.0
8 3 0 0.5
9 3 3 0.0
For the addendum:
# minimum B rows
min_rows = df.groupby('A')['B'].idxmin()
# minimum B rows with C==0
zeros = df.loc[min_rows].loc[lambda x: x['C']==0].copy()
# change all min rows to 0.5
df.loc[min_rows, 'C'] = 0.5
# concat with 0
df = pd.concat([df, zeros])
Output (notice the last row):
A B C
0 1 0 0.5
1 1 2 0.2
2 1 3 1.0
3 2 1 0.5
4 2 4 0.0
5 2 6 1.0
6 3 1 0.4
7 3 2 1.0
8 3 0 0.5
9 3 3 0.0
0 1 0 0.0

how to count rows when value change from value greater than threshold to 0

I have three columns in dataframe , X1 X2 X3 , i want to count rows when value change from value greater than 1 to 0 . if before 0 value less than 1 dont need to count.
input df:
df1=pd.DataFrame({'x1':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X2':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X3':[6,3,0.5,0,0,0,0,20,15,16,0,0,70]})
print(df1)
x1 X2 X3
0 3 3 6.0
1 4 4 3.0
2 7 7 0.5
3 0 0 0.0
4 0 0 0.0
5 0 0 0.0
6 0 0 0.0
7 20 20 20.0
8 15 15 15.0
9 16 16 16.0
10 0 0 0.0
11 0 0 0.0
12 70 70 70.0
desired_output
x1_count X2_count X3_count
0 6 6 2
Idea is replace 0 to missing values, forward filling them, convert all another values to NaNs, compare greater like 1 and count Trues by sum to Series converted to one row DataFrame with transpose:
m = df1.eq(0)
df2 = (df1.mask(m)
.ffill()
.where(m)
.gt(1)
.sum()
.add_suffix('_count')
.to_frame()
.T
)
print (df2)
x1_count X2_count X3_count
0 6 6 2

Pandas: customed rank function based on quantile

I have the following data frame.
item_id price quantile
0 1 10 0.1
1 3 20 0.2
2 4 30 0.3
3 6 40 0.4
4 11 50 0.5
5 12 60 0.6
6 15 70 0.7
7 20 80 0.8
8 25 90 0.9
9 26 100 1.0
I would like to have a customed rank function, which starts from the record whose quantile closest to 0.44, then goes down, and goes up, then goes down, and goes up ...
The result should look like:
item_id price quantile customed_rank
0 1 10 0.1 6
1 3 20 0.2 4
2 4 30 0.3 2
3 6 40 0.4 1
4 11 50 0.5 3
5 12 60 0.6 5
6 15 70 0.7 7
7 20 80 0.8 8
8 25 90 0.9 9
9 26 100 1.0 10
Other then looping over the entire data frame to do that, is there a more elegant way to achieve this? Thanks!
You want to rank by the absolute value of the difference between quantile and 0.44.
(df['quantile'] - 0.44).abs().rank()
0 7.0
1 5.0
2 3.0
3 1.0
4 2.0
5 4.0
6 6.0
7 8.0
8 9.0
9 10.0
Name: quantile, dtype: float64
A faster (but uglier) alternative is to argsort twice.
(df['quantile'] - 0.44).abs().values.argsort().argsort() + 1
array([ 7, 5, 3, 1, 2, 4, 6, 8, 9, 10])
Note that this solution is only faster if you work with Numpy array objects (through the values property), rather than Pandas series objects.

Python: Summing every five rows of column b data and create a new column

I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0

How to concatenate every 5 lines if column 2 equal to a certain value?

I have a file that if the second column has the number 2, I want to concatenate the next 5 lines, for example:
67 2
a b c
a b
0.1 0.2 0.3 0.4
0.3 0.9 0.7 0.1
09 3
b v c
5 6 7 8
78 2
p o p
q d
1.0 0.9 0.8 0.7
0.4 0.3 0.2 0.1
The output should be:
67 2 a b c a b 0.1 0.2 0.3 0.4 0.3 0.9 0.7 0.1
78 2 p o p q d 1.0 0.9 0.8 0.7 0.4 0.3 0.2 0.1
awk solution: To concatenate 5 lines (including pattern line) on each encountering the line with 2 in its 2nd column (excepting lines concatenated):
awk '$2==2{i=4;tail=$0; while (i-- && (getline nl) > 0) { tail=tail FS nl } print tail}' file
The output:
67 2 a b c a b 1 2 3 4 0 9 7 1
78 2 p o p q d 0 9 8 7 4 3 2 1

Resources