python: groupby - function called when grouping - python-3.x

All I would like to do a groupby and call a function at the same time.
Here is the function
def dollar_wtd_two(DF, kw_param, kw_param2):
return np.sum(DF[kw_param] * DF[kw_param2]) / DF[kw_param2].sum()
The function dollar_wtd_two has 3 parameters DF (the dataframe) and columns names of the same dataframe DF. Conceptually here what I would like to do:
DF.groupby(['prime_broker_id', 'country_name'],
as_index=False).agg({"notional_current": np.sum, "new_column":dollar_wtd_two(DF,
kw_param, kw_param2) })
Basically the groupby would do simple operations like sum or average and also more involved operations where I would call functions similar to dollar_wtd_two
Here is how the output of DF would like without the "new_column"
DF.groupby(['prime_broker_id', 'country_name'],
as_index=False).agg({"notional_current": np.sum })
Output 1:
prime_broker_id country_name notional_current
0 BARCAP AUSTRIA 2.616735e+07
1 BARCAP BELGIUM 6.327196e+07
2 BARCAP DENMARK 1.286309e+07
3 BARCAP FINLAND 4.181843e+07
4 BARCAP FRANCE 1.579292e+08
5 BARCAP GERMANY 2.653451e+08
6 BARCAP IRELAND 1.037968e+07
I am not able to show the output of DF with "new_column":dollar_wtd_two(DF, kw_param, kw_param2). However individually the output of dollar_wtd_two(DF, kw_param, kw_param2) should look like this:
Output 2:
prime_broker_id country_name
BARCAP AUSTRIA 25.009402
BELGIUM 25.083404
DENMARK 25.000000
FINLAND 25.034493
FRANCE 25.000000
GERMANY 25.007943
IRELAND 25.000000
ISRAEL 399.242997
The idea is to combine Output 1 and Output 2 in one operation. Please let me if it is unclear.
Any help is more than welcome
Thanks a lot

Related

How to list row headers from matrix based on binary value (Excel)?

I would like to extract/list from a matrix the row headers based on binary values and depending on the column. Basically FROM something like this:
Country Product1 Product2 Product3
Germany 1 0 1
France 1 1 0
Spain 0 1 0
Italy 1 0 1
Belgium 0 1 0
OBTAIN something like this:
Product1 Product2 Product3
Germany France Germany
France Spain Italy
Italy Belgium
So basically list the values based on column and binary value.
Better if no VBA is involved.
Any suggestion is welcome!
Assuming your data is in a table named Table1, for Office 365:
=T(SORT(IF(Table1[Product1],Table1[[Country]:[Country]])))
and use the fill handle to drag right.

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

python -count elements pandas dataframe

I have a table with some info about districts. I have converted it into a pandas dataframe and my question is how can I count how many times SOUTHERN, BAYVIEW etc. appear in the table below? I want to add an extra column next to District with the total number of each district.
District
0 SOUTHERN
1 BAYVIEW
2 CENTRAL
3 NORTH
Here you need to use a groupby and a size method (you can also use some other aggregations such as count)
With this dataframe:
import pandas as pd
df = pd.DataFrame({'DISTRICT': ['SOUTHERN', 'SOUTHERN', 'BAYVIEW', 'BAYVIEW', 'BAYVIEW', 'CENTRAL', 'NORTH']})
Represented as below
DISTRICT
0 SOUTHERN
1 SOUTHERN
2 BAYVIEW
3 BAYVIEW
4 BAYVIEW
5 CENTRAL
6 NORTH
You can use
df.groupby(['DISTRICT']).size().reset_index(name='counts')
You have this output
DISTRICT counts
0 BAYVIEW 3
1 CENTRAL 1
2 NORTH 1
3 SOUTHERN 2

Produce output from empty window

Is it possible to produce output from a stream analytics query, using the "group by window" expression, when the window is empty?
For instance, in this example, the query:
SELECT System.Timestamp as WindowEnd, SwitchNum, COUNT(*) as CallCount
FROM CallStream TIMESTAMP BY CallRecTime
GROUP BY TUMBLINGWINDOW(s, 5), SwitchNum
produces the output:
2015-04-15T22:10:40.000Z UK 1
2015-04-15T22:10:40.000Z US 1
2015-04-15T22:10:45.000Z China 1
2015-04-15T22:10:45.000Z Germany 1
2015-04-15T22:10:45.000Z UK 3
2015-04-15T22:10:45.000Z US 1
2015-04-15T22:10:50.000Z Australia 2
...
Is it possible to make it produce something like:
2015-04-15T22:10:40.000Z China 0
2015-04-15T22:10:40.000Z Germany 0
2015-04-15T22:10:40.000Z UK 1
2015-04-15T22:10:40.000Z US 1
2015-04-15T22:10:40.000Z Australia 0
2015-04-15T22:10:45.000Z China 1
2015-04-15T22:10:45.000Z Germany 1
2015-04-15T22:10:45.000Z UK 3
2015-04-15T22:10:45.000Z US 1
2015-04-15T22:10:45.000Z Australia 0
...
?
The objective is to detect, using a hopping window, if there were no events in the last x seconds.
Use a LEFT JOIN with a lookup table of SwitchNum values, which will produce a result with NULL if no values are in the window.
This blog post explains in more detail: http://blogs.msdn.com/b/streamanalytics/archive/2014/12/09/how-to-query-for-all-events-and-no-event-scenarios.aspx

Resources