I am struggling to replicate sas(another programming language) inner merge in python .
The python inner merge is not matching with sas inner merge when duplicate key values are coming .
Below is an example :
zw = pd.DataFrame({"ID":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Shivansh','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary":[22,15,10,9,16,18,22],
"city":['noida','bangalore','hyderabad','noida','pune','gurugram','bangalore'],
"ant":[10,15,15,10,16,17,18]})
zw1 = pd.DataFrame({"ID-":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Swati','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile_":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary_":[2,15,10,9,16,18,22],
"city_":['noida','kochi','hyderabad','noida','pune','gurugram','bangalore'],
"ant_":[1,15,15,10,16,17,18]})
zw and sw1 are the input tables . Both the tables need to be inner merged on the key column Name .The issue is both columns are having duplicate values in Name column .
Python is generating all possible combinations with the duplicate rows .
Below is the expected output :
I tried normal inner merge and tried dropping duplicate row with ID and Name columns , but still not getting the desired output .
df1=pd.merge(zw,zw1,on=['Name'],how='inner')
df1.drop_duplicates(['Name','ID'])
Use df.combine_first + df.sort_values combination:
df = zw.combine_first(zw1).sort_values('Name')
print(df)
ID ID- Name ant ant_ city city_ job_profile \
3 1 1 Amar 10 10 noida noida DataA
4 0 0 Arpit 16 16 pune pune AndroidD
6 1 1 Priyanka 18 18 bangalore bangalore fullstac
5 0 0 Ranjeet 17 17 gurugram gurugram PythonD
0 1 1 Shivansh 10 1 noida noida DataS
1 0 0 Shivansh 15 15 bangalore kochi SWD
2 0 0 Shivansh 15 15 hyderabad hyderabad DataA
job_profile_ salary salary_
3 DataA 9 9
4 AndroidD 16 16
6 fullstac 22 22
5 PythonD 18 18
0 DataS 22 2
1 SWD 15 15
2 DataA 10 10
I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.
I have a table like below.
segmentnum group 1 group 2 group 3 group 4
1 0 12 33 66
2 0 3 10 26
3 0 422 1433 3330
And a table like below.
vol segmentnum
0 1
58 1
66 1
48 1
9 2
13 2
7 2
10 3
1500 3
I'd like to add a column that tells me which group the vol for a given segmentnum belongs to. Such that
Group 1 = x to < group 2
Group 2 = x to < group 3
Group 3 = x to <= group 4
Desired result:
vol segmentnum group
0 1 1
58 1 3
66 1 3
48 1 3
9 2 2
13 2 3
7 2 2
10 3 3
1500 3 3
Per the accompanying image, put this in I2 and drag down.
=MATCH(G2, INDEX(B$2:E$4, MATCH(H2, A$2:A$4, 0), 0))
While these results differ from yours, I believe they are correct.
My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46
I have a query in stream analytics to count the requests per hour. Therefore I have a group by cause and I got the message "a window is expected". So I added a tumblingwindow, but I do not understand how it is working.
group by
datetimefromparts(year(r.starttime), month(r.starttime), day(r.starttime), datepart(hh, r.starttime),0,0,0),
r.tags,
r.IsMobile,
tumblingwindow(hh, 1)
When I choose hh I got these 11 results:
requestdate tags ismobilecount summary
2016-10-03T21:00:00.000Z A 0 6 2016-10-03T21:00:00.000ZA06
2016-10-03T21:00:00.000Z B 0 1 2016-10-03T21:00:00.000ZB01
2016-10-03T22:00:00.000Z A 0 20 2016-10-03T22:00:00.000ZA020
2016-10-03T21:00:00.000Z B 0 1 2016-10-03T21:00:00.000ZB01
2016-10-03T22:00:00.000Z B 0 14 2016-10-03T22:00:00.000ZB014
2016-10-03T21:00:00.000Z A 1 2 2016-10-03T21:00:00.000ZA12
2016-10-03T21:00:00.000Z B 1 1 2016-10-03T21:00:00.000ZB11
2016-10-03T21:00:00.000Z B 1 1 2016-10-03T21:00:00.000ZB11
2016-10-03T21:00:00.000Z A 1 1 2016-10-03T21:00:00.000ZA11
2016-10-03T22:00:00.000Z A 1 15 2016-10-03T22:00:00.000ZA115
2016-10-03T22:00:00.000Z B 1 22 2016-10-03T22:00:00.000ZB122
But when I choose dd (tumblingwindow(hh, 1)) I got these results:
requestdate tags ismobile count summary
2016-10-16T21:00:00.0000000Z B 1 45 2016-10-16T21:00:00.0000000ZB145
2016-10-16T22:00:00.0000000Z A 0 51 2016-10-16T22:00:00.0000000ZA051
2016-10-16T22:00:00.0000000Z A 1 49 2016-10-16T22:00:00.0000000ZA149
2016-10-16T22:00:00.0000000Z B 0 41 2016-10-16T22:00:00.0000000ZB041
2016-10-16T22:00:00.0000000Z B 1 39 2016-10-16T22:00:00.0000000ZB139
2016-10-16T23:00:00.0000000Z A 0 3 2016-10-16T23:00:00.0000000ZA03
2016-10-16T23:00:00.0000000Z A 0 39 2016-10-16T23:00:00.0000000ZA039
2016-10-16T23:00:00.0000000Z A 1 2 2016-10-16T23:00:00.0000000ZA12
2016-10-16T23:00:00.0000000Z A 1 38 2016-10-16T23:00:00.0000000ZA138
2016-10-16T23:00:00.0000000Z B 0 1 2016-10-16T23:00:00.0000000ZB01
2016-10-16T23:00:00.0000000Z B 0 46 2016-10-16T23:00:00.0000000ZB046
2016-10-16T23:00:00.0000000Z B 1 29 2016-10-16T23:00:00.0000000ZB129
2016-10-16T23:00:00.0000000Z B 1 4 2016-10-16T23:00:00.0000000ZB14
2016-10-17T00:00:00.0000000Z A 0 42 2016-10-17T00:00:00.0000000ZA042
2016-10-17T00:00:00.0000000Z A 1 36 2016-10-17T00:00:00.0000000ZA136
2016-10-17T00:00:00.0000000Z B 0 39 2016-10-17T00:00:00.0000000ZB039
2016-10-17T00:00:00.0000000Z B 1 45 2016-10-17T00:00:00.0000000ZB145
2016-10-17T01:00:00.0000000Z A 0 41 2016-10-17T01:00:00.0000000ZA041
When I run my job for some days, I got 8 rows for 23:00:00h every day.
But I am expecting only 4 rows for every hour. How can I solve that? Can someone explain me how this is working and how to solve this problem?
I believe it's due to your Tags always having A and B. This looks like it is expected behavior, since every hour you have an A and B. You will get the same thing with IsMobile when there are more than one distinct value in the hourly group.
So, based on Tags and IsMobile both having two possible choices every hour, you could have up to four rows per hour if there were the following events in that hour:
Tag: A, IsMobile: 0
Tag: A, IsMobile: 1
Tag: B, IsMobile: 0
Tag: B, IsMobile: 1