PivotTable with multiple conditions to count unique items - excel

Following is the portion of my table in Excel:
A B C D E
5 10 1 18316 3
5 11 1 18313 3
5 11 2 18002 3
5 11 3 10825 3
5 12 1 18316 3
5 12 2 18001 3
5 12 3 10825 3
5 13 1 18313 3
5 13 2 18002 3
5 14 1 18316 3
5 14 2 18001 3
5 14 3 18002 3
5 15 1 18313 3
5 16 1 18316 3
5 16 2 18002 3
5 16 3 18313 3
5 17 1 18313 3
5 17 2 18002 3
5 17 3 18316 3
5 20 1 18313 3
5 21 1 18316 3
5 21 2 18001 3
5 21 3 18313 3
15 10 1 47009 3
15 10 2 40802 3
15 11 1 47009 3
15 12 1 47010 3
15 12 2 47009 3
15 13 1 47009 3
15 13 2 47010 3
15 14 1 47010 3
What I want to achieve is the following:
To be able to calculate the count of a number in column D for every unique B and A with respect to C (if D is at the Max of C or not)
Output something like:
Filter: 18001 on Column D
5
12 1 Non-Max
14 1 Non-Max
21 1 Non-Max
Similarly if the filter is changed to 18316:
5
10 1 Max
12 1 Non-Max
14 1 Non-Max
16 1 Non-Max
17 1 Max
21 1 Non-Max
I have 20K rows of data that needs processing.

I seem to be able to achieve close to the results you indicate from the data you have provided - but have no idea what you mean by "for every unique B and A with respect to C (if D is at the Max of C or not)". I applied a PivotTable as below:
Max and Non-Max being indicated by the relationship between Count of E and Max of C - which could be used in a simple formula to display Max or Non-Max outside the PivotTable.

Related

how to split data in groups by two column conditions pandas

I have dataframe, i want to split dataframe in groups based on condition from flag_0 and flag_1 column , when flag_0 is '3' and and flag_1 is '1' continous.
Here is my dataframe example:
df=pd.DataFrame({'flag_0':[1,2,3,1,2,3,1,2,3,3,3,3,1,2,3,1,2,3,4,4],'flag_1':[1,2,3,1,2,3,1,2,1,1,1,1,1,2,1,1,2,3,4,4],'dd':[1,1,1,7,7,7,8,8,8,1,1,1,7,7,7,8,8,8,5,7]})
Out[172]:
flag_0 flag_1 dd
0 1 1 1
1 2 2 1
2 3 3 1
3 1 1 7
4 2 2 7
5 3 3 7
6 1 1 8
7 2 2 8
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1
12 1 1 7
13 2 2 7
14 3 1 7
15 1 1 8
16 2 2 8
17 3 3 8
18 4 4 5
19 4 4 7
Desired output:
group_1
Out[172]:
flag_0 flag_1 dd
9 3 1 1
10 3 1 1
11 3 1 1
group 2
Out[172]:
flag_0 flag_1 dd
14 3 1 7
You can use a mask and groupby to split the dataframe:
cond = {'flag_0': 3, 'flag_1': 1}
mask = df[list(cond)].eq(cond).all(1)
groups = [g for k,g in df[mask].groupby((~mask).cumsum())]
output:
[ flag_0 flag_1 dd
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1,
flag_0 flag_1 dd
14 3 1 7]
groups[0]
flag_0 flag_1 dd
8 3 1 8
9 3 1 1
10 3 1 1
11 3 1 1

Sum two dataframes for equal entries

I have two dataframes with same entries in column A, but different entries in columns B and C.
One dataframe has multiple entries for one entry in A.
df1
A B C
0 this 3 4
1 is 4 6
2 an 7 9
3 example 12 20
df2
A B C
0 this 11 11
1 this 5 9
2 this 18 7
3 is 12 14
4 an 1 4
5 an 8 12
6 example 3 17
7 example 9 5
8 example 19 6
9 example 7 1
I want to sum the two dataframes for same entries in column A. The result shoul look like this:
df3
A B C
0 this 14 15
1 this 8 13
2 this 21 11
3 is 16 20
4 an 8 13
5 an 15 21
6 example 15 37
7 example 21 25
8 example 31 26
9 example 19 21
How can I calculate this in a fast way in pandas?
Use DataFrame.merge to left merge the dataframe df2 with df1 on column A then add the columns B, C of df2 to the columns B, C of df3:
df3 = df2[['A']].merge(df1, on='A', how='left')
df3[['B', 'C']] += df2[['B', 'C']]
Result:
print(df3)
A B C
0 this 14 15
1 this 8 13
2 this 21 11
3 is 16 20
4 an 8 13
5 an 15 21
6 example 15 37
7 example 21 25
8 example 31 26
9 example 19 21
OR another possible idea if order is not important:
df3 = df2.set_index('A').add(df1.set_index('A')).reset_index()
print(df3)
A B C
0 an 8 13
1 an 15 21
2 example 15 37
3 example 21 25
4 example 31 26
5 example 19 21
6 is 16 20
7 this 14 15
8 this 8 13
9 this 21 11

How to write Python code that does cumprod for forward 2 periods with groupby

I want to calculate Return, RET, which is Cumulative of 2 periods (now & next period) with groupby(id).
df['RET'] = df.groupby('id')['trt1m1'].rolling(2,min_periods=2).apply(lambda x:x.prod()).reset_index(0,drop=True)
Expected Result:
id datadate trt1m1 RET
1 20051231 1 2
1 20060131 2 6
1 20060228 3 12
1 20060331 4 16
1 20060430 4 20
1 20060531 5 Nan
2 20061031 10 110
2 20061130 11 165
2 20061231 15 300
2 20070131 20 420
2 20070228 21 Nan
Actual Result:
id datadate trt1m1 RET
1 20051231 1 Nan
1 20060131 2 2
1 20060228 3 6
1 20060331 4 12
1 20060430 4 16
1 20060531 5 20
2 20061031 10 Nan
2 20061130 11 110
2 20061231 15 165
2 20070131 20 300
2 20070228 21 420
The code i used calculate cumprod for trailing 2 periods instead of forward.

Variance-covariance matrix with multiple columns

I have the following data:
at_score atp_1 atp_2 atp_3 g_date g_id g_time ht_diff ht_score htp_1 htp_2 htp_3
0 0 6 7 8 11/16/18 1 0 0 0 1 2 3
1 13 6 7 9 11/16/18 1 15 2 15 1 2 3
2 20 7 8 10 11/16/18 1 18 2 22 3 4 5
3 40 7 8 6 11/16/18 1 33 5 45 4 1 2
4 65 8 7 6 11/16/18 1 60 -3 62 1 2 3
5 0 6 7 8 11/20/18 2 0 0 0 1 2 3
6 10 9 7 8 11/20/18 2 7 -4 6 4 2 3
7 26 6 10 7 11/20/18 2 24 -1 25 1 5 4
8 40 9 7 8 11/20/18 2 42 5 45 1 2 5
9 65 6 7 10 11/20/18 2 60 5 70 1 5 2
where at_score, ht_score are the away & home team's score on a particular date (g_date), in a particular game (g_id), & at a particular time in the game (g_time). ht_diff represents the home team's score differential (ht_score - at_score). Finally, and for my purposes most importantly, atp_1, atp_2, atp_3 are the 3 away players who are playing at that point. htp_1, htp_2, htp_3 are their home team counterparts.
What I'd like to calculate is the variance-covariance matrix for each of the home & away team players based on how the ht_diff, ht_score & at_score changed while they were playing and the players they were playing with. For example away player 6 played with players 7 & 8 for the first 13 minutes of g_id 1 (ht_diff = 2 for this period) & the last 27 minutes (ht_diff = -3).
In the end I have about 2.5 million observations (as well as 10 players playing at a time) so finding a 'easy' to calculate this would be extremely helpful.

How to find the numver of duplicate lines, each line contains a few numbers seperated by spaces

Suppose i have a file like this...
4 2 8 2 12 3 18 2 22 2 26 2 28 3 30 2
4 3 10 2 14 2 18 2 20 3 22 2 28 2 32 2
2 3 10 3 12 2 16 2 18 3 20 2 24 2 26 3
1 3 3 3 17 3 19 3 26 2 28 2 30 2 32 2
4 2 8 2 12 3 18 2 22 2 26 2 28 3 30 2
the first and the last line are the same in the input...
I want the output to be like ...
4 2 8 2 12 3 18 2 22 2 26 2 28 3 30 2 2
4 3 10 2 14 2 18 2 20 3 22 2 28 2 32 2 1
2 3 10 3 12 2 16 2 18 3 20 2 24 2 26 3 1
1 3 3 3 17 3 19 3 26 2 28 2 30 2 32 2 1
The extra last coloum in the output simply specifies the extra number of lines.....
how can i do this in bash...
i know the sort command but it only works with one number per line....
Coming from sehe's suggestion, what about this?
sort your_file | uniq -c | awk '{for(i=2;i<=NF;i++) printf $i"\t"; printf $1"\n"}'
Output:
1 3 3 3 17 3 19 3 26 2 28 2 30 2 32 2 1
2 3 10 3 12 2 16 2 18 3 20 2 24 2 26 3 1
4 2 8 2 12 3 18 2 22 2 26 2 28 3 30 2 2
4 3 10 2 14 2 18 2 20 3 22 2 28 2 32 2 1

Resources