Aggregate Total count - python-3.x

I want to merge two columns(Sender and Receiver) and get the Transaction Type count.
Sender Receiver Type Amount Date
773787639 777611388 1 300 2/1/2019
773631898 776806843 4 450 8/20/2019
773761571 777019819 6 369 2/11/2019
774295511 777084440 34 1000 1/22/2019
774263079 776816905 45 678 6/27/2019
774386894 777202863 12 2678 2/10/2019
773671537 777545555 14 38934 9/29/2019
774288117 777035194 18 21 4/22/2019
774242382 777132939 21 1275 9/30/2019
774144715 777049859 30 6309 7/4/2019
773911674 776938987 10 3528 5/1/2019
773397863 777548054 15 35892 7/6/2019
776816905 772345091 6 1234 7/7/2019
777035194 775623065 4 453454 7/20/2019
I am try to get like this kind of table
Sender/Receiver Type_1 Type_4 Type_12...... Type_45
773787639 3 2 0 0
773631898 1 0 1 2
773397863 2 2 0 0
772345091 1 1 0 3

You are looking for a pivot query. The only twist here is that we need to first take a union of the table to combine the sender/receiver data into a single column.
SELECT
SenderReceiver,
COUNT(CASE WHEN Type = 1 THEN 1 END) AS Type_1,
COUNT(CASE WHEN Type = 2 THEN 1 END) AS Type_2,
COUNT(CASE WHEN Type = 3 THEN 1 END) AS Type_3,
...
COUNT(CASE WHEN Type = 45 THEN 1 END) AS Type_45
FROM
(
SELECT Sender AS SenderReceiver, Type FROM yourTable
UNION ALL
SELECT Receiver, Type FROM yourTable
) t
GROUP BY
SenderReceiver;
If you don't want to type out 45 separate CASE expressions, you could probably automate it to some degree using Python.

Related

Inner merge in python with tables having duplicate values in key column

I am struggling to replicate sas(another programming language) inner merge in python .
The python inner merge is not matching with sas inner merge when duplicate key values are coming .
Below is an example :
zw = pd.DataFrame({"ID":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Shivansh','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary":[22,15,10,9,16,18,22],
"city":['noida','bangalore','hyderabad','noida','pune','gurugram','bangalore'],
"ant":[10,15,15,10,16,17,18]})
zw1 = pd.DataFrame({"ID-":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Swati','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile_":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary_":[2,15,10,9,16,18,22],
"city_":['noida','kochi','hyderabad','noida','pune','gurugram','bangalore'],
"ant_":[1,15,15,10,16,17,18]})
zw and sw1 are the input tables . Both the tables need to be inner merged on the key column Name .The issue is both columns are having duplicate values in Name column .
Python is generating all possible combinations with the duplicate rows .
Below is the expected output :
I tried normal inner merge and tried dropping duplicate row with ID and Name columns , but still not getting the desired output .
df1=pd.merge(zw,zw1,on=['Name'],how='inner')
df1.drop_duplicates(['Name','ID'])
Use df.combine_first + df.sort_values combination:
df = zw.combine_first(zw1).sort_values('Name')
print(df)
ID ID- Name ant ant_ city city_ job_profile \
3 1 1 Amar 10 10 noida noida DataA
4 0 0 Arpit 16 16 pune pune AndroidD
6 1 1 Priyanka 18 18 bangalore bangalore fullstac
5 0 0 Ranjeet 17 17 gurugram gurugram PythonD
0 1 1 Shivansh 10 1 noida noida DataS
1 0 0 Shivansh 15 15 bangalore kochi SWD
2 0 0 Shivansh 15 15 hyderabad hyderabad DataA
job_profile_ salary salary_
3 DataA 9 9
4 AndroidD 16 16
6 fullstac 22 22
5 PythonD 18 18
0 DataS 22 2
1 SWD 15 15
2 DataA 10 10

Pandas dataframe: Count no of rows which meet a set of conditions across multiple columns [duplicate]

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

Excel - Lookup the group based on value range per segment

I have a table like below.
segmentnum group 1 group 2 group 3 group 4
1 0 12 33 66
2 0 3 10 26
3 0 422 1433 3330
And a table like below.
vol segmentnum
0 1
58 1
66 1
48 1
9 2
13 2
7 2
10 3
1500 3
I'd like to add a column that tells me which group the vol for a given segmentnum belongs to. Such that
Group 1 = x to < group 2
Group 2 = x to < group 3
Group 3 = x to <= group 4
Desired result:
vol segmentnum group
0 1 1
58 1 3
66 1 3
48 1 3
9 2 2
13 2 3
7 2 2
10 3 3
1500 3 3
Per the accompanying image, put this in I2 and drag down.
=MATCH(G2, INDEX(B$2:E$4, MATCH(H2, A$2:A$4, 0), 0))
While these results differ from yours, I believe they are correct.

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

How does stream analytics query windowing work and how do I avoid double rows?

I have a query in stream analytics to count the requests per hour. Therefore I have a group by cause and I got the message "a window is expected". So I added a tumblingwindow, but I do not understand how it is working.
group by
datetimefromparts(year(r.starttime), month(r.starttime), day(r.starttime), datepart(hh, r.starttime),0,0,0),
r.tags,
r.IsMobile,
tumblingwindow(hh, 1)
When I choose hh I got these 11 results:
requestdate tags ismobilecount summary
2016-10-03T21:00:00.000Z A 0 6 2016-10-03T21:00:00.000ZA06
2016-10-03T21:00:00.000Z B 0 1 2016-10-03T21:00:00.000ZB01
2016-10-03T22:00:00.000Z A 0 20 2016-10-03T22:00:00.000ZA020
2016-10-03T21:00:00.000Z B 0 1 2016-10-03T21:00:00.000ZB01
2016-10-03T22:00:00.000Z B 0 14 2016-10-03T22:00:00.000ZB014
2016-10-03T21:00:00.000Z A 1 2 2016-10-03T21:00:00.000ZA12
2016-10-03T21:00:00.000Z B 1 1 2016-10-03T21:00:00.000ZB11
2016-10-03T21:00:00.000Z B 1 1 2016-10-03T21:00:00.000ZB11
2016-10-03T21:00:00.000Z A 1 1 2016-10-03T21:00:00.000ZA11
2016-10-03T22:00:00.000Z A 1 15 2016-10-03T22:00:00.000ZA115
2016-10-03T22:00:00.000Z B 1 22 2016-10-03T22:00:00.000ZB122
But when I choose dd (tumblingwindow(hh, 1)) I got these results:
requestdate tags ismobile count summary
2016-10-16T21:00:00.0000000Z B 1 45 2016-10-16T21:00:00.0000000ZB145
2016-10-16T22:00:00.0000000Z A 0 51 2016-10-16T22:00:00.0000000ZA051
2016-10-16T22:00:00.0000000Z A 1 49 2016-10-16T22:00:00.0000000ZA149
2016-10-16T22:00:00.0000000Z B 0 41 2016-10-16T22:00:00.0000000ZB041
2016-10-16T22:00:00.0000000Z B 1 39 2016-10-16T22:00:00.0000000ZB139
2016-10-16T23:00:00.0000000Z A 0 3 2016-10-16T23:00:00.0000000ZA03
2016-10-16T23:00:00.0000000Z A 0 39 2016-10-16T23:00:00.0000000ZA039
2016-10-16T23:00:00.0000000Z A 1 2 2016-10-16T23:00:00.0000000ZA12
2016-10-16T23:00:00.0000000Z A 1 38 2016-10-16T23:00:00.0000000ZA138
2016-10-16T23:00:00.0000000Z B 0 1 2016-10-16T23:00:00.0000000ZB01
2016-10-16T23:00:00.0000000Z B 0 46 2016-10-16T23:00:00.0000000ZB046
2016-10-16T23:00:00.0000000Z B 1 29 2016-10-16T23:00:00.0000000ZB129
2016-10-16T23:00:00.0000000Z B 1 4 2016-10-16T23:00:00.0000000ZB14
2016-10-17T00:00:00.0000000Z A 0 42 2016-10-17T00:00:00.0000000ZA042
2016-10-17T00:00:00.0000000Z A 1 36 2016-10-17T00:00:00.0000000ZA136
2016-10-17T00:00:00.0000000Z B 0 39 2016-10-17T00:00:00.0000000ZB039
2016-10-17T00:00:00.0000000Z B 1 45 2016-10-17T00:00:00.0000000ZB145
2016-10-17T01:00:00.0000000Z A 0 41 2016-10-17T01:00:00.0000000ZA041
When I run my job for some days, I got 8 rows for 23:00:00h every day.
But I am expecting only 4 rows for every hour. How can I solve that? Can someone explain me how this is working and how to solve this problem?
I believe it's due to your Tags always having A and B. This looks like it is expected behavior, since every hour you have an A and B. You will get the same thing with IsMobile when there are more than one distinct value in the hourly group.
So, based on Tags and IsMobile both having two possible choices every hour, you could have up to four rows per hour if there were the following events in that hour:
Tag: A, IsMobile: 0
Tag: A, IsMobile: 1
Tag: B, IsMobile: 0
Tag: B, IsMobile: 1

Resources