pandas groupby trying to optimse several steps - pandas-groupby

I've been trying to optimise a bokeh server to calculate live stats by selected country on Covid19.
I found myself repeating a groupby function to calculate new columns and was wondering, having selected the groupby, if I could then apply it in a similar way to .agg() on multiple columns ?
For example:
dfall = pd.DataFrame(db("SELECT * FROM C19daily"))
dfall.set_index(['geoId', 'date'], drop=False, inplace=True)
dfall = dfall.sort_index(ascending=True)
dfall.head()
id date geoId cases deaths auid
geoId date
AD 2020-03-03 70119 2020-03-03 AD 1 0 AD03/03/2020
2020-03-14 70118 2020-03-14 AD 1 0 AD14/03/2020
2020-03-16 70117 2020-03-16 AD 3 0 AD16/03/2020
2020-03-17 70116 2020-03-17 AD 9 0 AD17/03/2020
2020-03-18 70115 2020-03-18 AD 0 0 AD18/03/2020
I need to create new columns based on 'cases' and 'deaths' and applying various functions like cumsum(). Currently I do this the long way
dfall['ccases'] = dfall.groupby(level=0)['cases'].cumsum()
dfall['dpc_cases'] = dfall.groupby(level=0)['cases'].pct_change(fill_method='pad', periods=7)
.....
dfall['cdeaths'] = dfall.groupby(level=0)['deaths'].cumsum()
dfall['dpc_deaths'] = dfall.groupby(level=0)['deaths'].pct_change(fill_method='pad', periods=7)
I tried to optimise the groupby call like this:-
with dfall.groupby(level=0) as gr:
gr = g['cases'].cumsum()...
But the error suggest the class doesn't support this
AttributeError: __enter__
I thought I could use .agg({}) and supply dictionary
g = dfall.groupby(level=0).agg({'cc' : 'cumsum', 'cd' : 'cumsum'})
but that produces another error
pandas.core.base.SpecificationError: nested renamer is not supported
I have plenty of other bits to optimise, I thought this python part would be the easiest and save a few ms!
Could anyone nudge me in the right direction?

To avoid repeating dfall.groupby(level=0) you can just save it in a variable:
gb = dfall.groupby(level=0)
gb_cases = gb['cases']
dfall['ccases'] = gb_cases.cumsum()
dfall['dpc_cases'] = gb_cases.pct_change(fill_method='pad', periods=7)
...
And to run multiple aggregations using a single expression, I think you can use named aggregation. But I have no clue whether it will be more performant or not. Either way, it's better to profile the code and improve the actual bottlenecks.

Related

How to handle errors with TimeDelta and Integers in Python

I need to calculate the distance between two dates.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1'])
When I save this into my SQLite DB the format is terrible, so I decided to use an integer format which is much better.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1']).astype('timedelta64[D]').astype(int)
So far so good, but in a similar case, I've NULL values which cause an error when I try to do the diference between dates.
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
The Break_date can be null, so I want that in this case the final result in dist_B_3 is 0, but now is an error that breaks everything. I tested this so far, but doesn't work...
try:
if df3['Break_date'] == 'NaT':
df3['dist_B_3'] = 0
else:
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
#().astype('timedelta64[D]').astype(int)
except Exception:
print("error in the dist_B_3")
My df3['Break_date'] df is this one, so the NaT are the ones creating the error.
0 2022-07-13
1 2022-07-12
2 2022-07-14
3 2022-07-14
4 NaT
5 NaT
Any idea on how to handle this?

What is the simplest way to complete a function on every row of a large table?

so I want to do a fisher exact test (one sided) on every row of a 3000+ row table with a format matching the below example
gene
sample_alt
sample_ref
population_alt
population_ref
One
4
556
770
37000
Two
5
555
771
36999
Three
6
554
772
36998
I would ideally like to make another column of the table equivalent to
[(4+556)!(4+770)!(770+37000)!(556+37000)!]/[4!(556!)770!(37000!)(4+556+770+37000)!]
for the first row of data, and so on and so forth for each row of the table.
I know how to do a fisher test in R for simple 2x2 tables, but I wouldn't know how I would apply the fisher.test() function to each row of a large table. I also can't use an excel formula because the numbers get so big with the factorials that they reach excel's digit limit and result in a #NUM error. What's the best way to simply complete this? Thanks in advance!
Beginning with a tab-delimited text file on desktop (table.txt) with the same format as shown in the stem question
if(!require(psych)){install.packages("psych")}
multiFisher = function(file="Desktop/table.txt", saveit=TRUE,
outfile="Desktop/table.csv", progress=T,
verbose=FALSE, digits=3, ... )
{
require(psych)
Data = read.table(file, skip=1, header=F,
col.names=c("Gene", "MD", "WTD", "MC", "WTC"), ...)
if(verbose){print(str(Data))}
Data$Fisher.p = NA
Data$phi = NA
Data$OR1 = format(0.123, nsmall=3)
Data$OR2 = NA
if(progress){cat("\n")}
for(i in 1:length(Data$Gene)){
Matrix = matrix(c(Data$WTC[i],Data$MC[i],Data$WTD[i],Data$MD[i]), nrow=2)
Fisher = fisher.test(Matrix, alternative = 'greater')
Data$Fisher.p[i] = signif(Fisher$p.value, digits=digits)
Data$phi[i] = phi(Matrix, digits=digits)
OR1 = (Data$WTC[i]*Data$MD[i])/(Data$MC[i]*Data$WTD[i])
OR2 = 1 / OR1
Data$OR1[i] = format(signif(OR1, digits=digits), nsmall=3)
Data$OR2[i] = signif(OR2, digits=digits)
if(progress) {cat(".")}
}
if(progress){cat("\n"); cat("\n")}
if(saveit){write.csv(Data, outfile)}
return(Data)
}
multiFisher()

Failing to use sumproduct on date ranges with multiple conditions [Python]

From replacement data table (below on the image), I am trying to incorporate the solbox product replace in time series data format(above on the image). I need to extract out the number of consumers per day from the information.
What I need to find out:
On a specific date, which number of solbox product was active
On a specific date, which number of solbox product (which was a consumer) was active
I have used this line of code in excel but cannot implement this on python properly.
=SUMPRODUCT((Record_Solbox_Replacement!$O$2:$O$1367 = "consumer") * (A475>=Record_Solbox_Replacement!$L$2:$L$1367)*(A475<Record_Solbox_Replacement!$M$2:$M$1367))
I tried in python -
timebase_df['date'] = pd.date_range(start = replace_table_df['solbox_started'].min(), end = replace_table_df['solbox_started'].max(), freq = frequency)
timebase_df['date_unix'] = timebase_df['date'].astype(np.int64) // 10**9
timebase_df['no_of_solboxes'] = ((timebase_df['date_unix']>=replace_table_df['started'].to_numpy()) & (timebase_df['date_unix'] < replace_table_df['ended'].to_numpy() & replace_table_df['customer_type'] == 'customer']))
ERROR:
~\Anaconda3\Anaconda4\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
232 # The ambiguous case is object-dtype. See GH#27803
233 if len(lvalues) != len(rvalues):
--> 234 raise ValueError("Lengths must match to compare")
235
236 if should_extension_dispatch(lvalues, rvalues):
ValueError: Lengths must match to compare
Can someone help me please? I can explain in comment section if I have missed something.

Replace values in observations (i.e., multiple columns within multiple rows) based on multiple conditionals

I am trying to replace the values of 3 columns within multiple observations based on two conditionals ( e.g., specific ID after a particular date).
I have seen similar questions.
Pandas Multiple Conditions Function based on Column
Pandas replace, multi column criteria
Pandas: How do I assign values based on multiple conditions for existing columns?
Replacing values in a pandas dataframe based on multiple conditions
However, they did not quite address my problem or I can't quite manipulate them to solve my problem.
This code will generate a dataframe similar to mine:
df = pd.DataFrame({'SUR_ID': {0:'SUR1', 1:'SUR1', 2:'SUR1', 3:'SUR1', 4:'SUR2', 5:'SUR2'}, 'DATE': {0:'05-01-2019', 1:'05-11-2019', 2:'06-15-2019', 3:'06-20-2019', 4: '05-15-2019', 5:'06-20-2019'}, 'ACTIVE_DATE': {0:'05-01-2019', 1:'05-01-2019', 2:'05-01-2019', 3:'05-01-2019', 4: '05-01-2019', 5:'05-01-2019'}, 'UTM_X': {0:'444895', 1:'444895', 2:'444895', 3:'444895', 4: '445050', 5:'445050'}, 'UTM_Y': {0:'4077528', 1:'4077528', 2:'4077528', 3:'4077528', 4: '4077762', 5:'4077762'}})
Output Dataframe:
What I am trying to do:
I am trying to replace UTM_X,UTM_Y, AND ACTIVE_DATE with
[444917, 4077830, '06-04-2019']
when
SUR_ID is "SUR1" and DATE >= "2019-06-04 12:00:00"
This is a poorly adapted version of the solution for question 1 in attempts to fix my problem- throws error:
df.loc[[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00'], ['UTM_X', 'UTM_Y', 'Active_Date']] = [444917, 4077830, '06-04-2019']
First ensure that the column Date is of type datetime, and then when using 2 conditions, they need to be between parenthesis individually. so you can do:
df.DATE = pd.to_datetime(df.DATE)
df.loc[ (df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00')),
['UTM_X', 'UTM_Y', 'ACTIVE_DATE']] = [444917, 4077830, '06-04-2019']
See the difference between what you wrote for the boolean mask:
[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00']
and what is here with parenthesis
(df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00'))
Use:
df['UTM_X']=df['UTM_X'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),444917)
df['UTM_Y']=df['UTM_Y'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),4077830)
df['ACTIVE_DATE']=df['ACTIVE_DATE'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),'06-04-2019')
Output:
SUR_ID DATE ACTIVE_DATE UTM_X UTM_Y
0 SUR1 05-01-2019 05-01-2019 444895 4077528
1 SUR1 05-11-2019 05-01-2019 444895 4077528
2 SUR1 06-15-2019 06-04-2019 444917 4077830
3 SUR1 06-20-2019 06-04-2019 444917 4077830
4 SUR2 05-15-2019 05-01-2019 445050 4077762
5 SUR2 06-20-2019 05-01-2019 445050 4077762

how to find exponential weighted moving average using dataframe.ewma?

Previously I used the following to calculate the ewma
dataset['26ema'] = pd.ewma(dataset['price'], span=26)
But, in the latest version of pandas pd.ewma has been removed. How to calculate using the new method dataframe.ewma?
dataset['26ema'] = dataset['price'].ewma(span=26)
This is giving an error 'AttributeError: 'Series' object has no attribute 'ewma'
Use Series.ewm:
dataset['price'].ewm(span=26)
See GH11603 for the relevant PR and mapping of the old API to new ones.
Minimal Code Example
s = pd.Series(range(5))
s.ewm(span=3).mean()
0 0.000000
1 0.666667
2 1.428571
3 2.266667
4 3.161290
dtype: float64

Resources