Pandas: re-shape/ re-pivot a data frame after groupby

Pandas: re-shape/ re-pivot a data frame after groupby - python-3.x

I am applying the quantile function on the duration column of my data frame:
a=df.groupby('version')[['duration']].quantile([.25, .5, .75])
a
duration
version
4229 0.25 1451.00
0.50 1451.00
0.75 1451.00
6065 0.25 213.75
0.50 426.50
0.75 639.25
9209 0.25 386.50
0.50 861.00
0.75 866.00
2304 0.25 664.50
0.50 669.00
0.75 736.50
6389 0.25 1.00
0.50 797.00
0.75 832.00
I am wondering how do I re-shape/re-pivot the above data frame, so the new data frame (yes, it has to be a data frame format) could look like:
version duration_Q1 duration_Q2 duration_Q3
4429 1451.00 1451.00 1451.00
6065 213.75 426.50 639.25
9209 386.50 861.00 866.00
2304 664.50 669.00 736.50
6389 1.00 797.00 832.00
Thanks!

You could use unstack, followed by some renaming operations
a = pd.DataFrame('duration': {(2304L, 0.25): 1565.6861959516361,
(2304L, 0.5): 446.4769649280514,
(2304L, 0.75): 701.8254115357969,
(4229L, 0.25): 1868.982390749203,
(4229L, 0.5): 242.36201172579996,
(4229L, 0.75): 789.482292226787,
(6065L, 0.25): 1421.9585894685038,
(6065L, 0.5): 357.04491735326343,
(6065L, 0.75): 169.78973203074895,
(6389L, 0.25): 1789.1550141153925,
(6389L, 0.5): 516.9365429825862,
(6389L, 0.75): 1830.6493228794639,
(9209L, 0.25): 1129.853279993191,
(9209L, 0.5): 1759.1258334115485,
(9209L, 0.75): 1499.0498929925702}}
)
pvt = a.unstack()
pvt.columns = pvt.columns.droplevel(0)
pvt.rename(columns={0.25:'duration_Q1',0.5:'duration_Q2',0.75:'duration_Q3'},inplace=True)
duration_Q1 duration_Q2 duration_Q3
version
2304 1565.686196 446.476965 701.825412
4229 1868.982391 242.362012 789.482292
6065 1421.958589 357.044917 169.789732
6389 1789.155014 516.936543 1830.649323
9209 1129.853280 1759.125833 1499.049893

Related

Need help finding the best time performing lines of code to replace existing-working code

The question is very straight forward. Content here is information collected during the run of an interconnected system. I simplified functions to emphasise the bottlenecks from line_profiler into three dependend functions objectively similar to real code.
As of now it seems all lines are mandatory, but I refuse to believe this. I did try simplifying logical parts with .all() and .any() but either I am not proficient in Python, or I keep running into a bizarre limitation; just keep giving me a different result than expected.
If you are unable to gather full use case of the 3 functions, try to comprehend on a purely syntactic manner (while at least preserving same number of variables) and provide a usable solution to achieve the same functionality.
Total time: 419.632 s
File: switch_array_logic.py
Function: func_A at line 336
Line # Hits Time Per Hit % Time Line Contents
==============================================================
336 #profile
337 def func_A(_switch,_loc,_prev_state=None,_cur_state=None,bulb_on=True):
338 43048749 27529633.0 0.6 6.6 if bulb_on:
339 43048651 43186202.0 1.0 10.3 _p_cur_state = _switch[_loc][2]
340 43048651 42282929.0 1.0 10.1 _p_prev_state = _switch[_loc-1][2]
341 else:
342 98 80.0 0.8 0.0 _p_prev_state = _prev_state
343 98 54.0 0.6 0.0 _p_cur_state = _cur_state
348 43048749 38735332.0 0.9 9.2 cond_11 = (_p_cur_state >= 0)
349 43048749 35887392.0 0.8 8.6 cond_10 = (_p_cur_state < 0)
350 43048749 34747819.0 0.8 8.3 cond_00 = (_p_prev_state < 0)
351 43048749 34685101.0 0.8 8.3 cond_01 = (_p_prev_state > 0)
352
353 43048749 24860540.0 0.6 5.9 cond_x = cond_00 and cond_11
354 43048749 24083554.0 0.6 5.7 cond_y = cond_10 and cond_01
355 43048749 24902602.0 0.6 5.9 condition = cond_x or cond_y
356
357 43048749 23279349.0 0.5 5.5 _brighness = 0
358 43048749 23704315.0 0.6 5.6 if condition:
359 21573867 18852346.0 0.9 4.5 _brighness = _p_prev_state/2
360 43048749 22894531.0 0.5 5.5 return _brighness
Total time: 66.9593 s
File: switch_array_logic.py
Function: func_B at line 362
Line # Hits Time Per Hit % Time Line Contents
==============================================================
362 #profile
363 def func_B(_switch,_pos,_loc,_brighness):
364 43638012 66959275.0 1.5 100.0 return _switch[_pos][3] - _switch[_loc][3] - _brighness
Total time: 9.59384 s
File: switch_array_logic.py
Function: func_C at line 378
Line # Hits Time Per Hit % Time Line Contents
==============================================================
378 #profile
379 def func_C(_switchs_slice,_valid,_PH_LEVEL):
380 114012 82093.0 0.7 0.9 _idx = 0
381 114012 2403035.0 21.1 25.0 _brighness = func_A(_switchs_slice,_idx+1)
382 809223 583712.0 0.7 6.1 for _pos in range(1, _switchs_slice.shape[0]):
383 808216 831149.0 1.0 8.7 if _valid and _switchs_slice[_idx][0] < _switchs_slice[_pos][0]:
385 57307 44704.0 0.8 0.5 return False,_pos
386 750909 1686519.0 2.2 17.6 elif ~_valid and _switchs_slice[_idx][1] > _switchs_slice[_pos][1]:
388 47536 41899.0 0.9 0.4 return False,_pos
389 703373 3320805.0 4.7 34.6 _ph_out = func_B(_switchs_slice,_pos,_idx,_brighness)
391 703373 592597.0 0.8 6.2 if abs(_ph_out) > _PH_LEVEL[1] :
393 8162 6334.0 0.8 0.1 return True, _pos #Successful Case
395 1007 990.0 1.0 0.0 return False,_switchs_slice.shape[0]
Additionally, as I am not familiar with Cython, but can someone try to provide an optimised version of this in Cython in answers really appreciate it. Not sure we will use it in the code. But I would very much like to see performance disparity.
Thanks in advance!

How to calculate Sensitivity, specificity and pos predictivity for each class in multi class classficaition

I have checked all SO question which generate confusion matrix and calculate TP, TN, FP, FN.
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
Mainly it usage
from sklearn.metrics import confusion_matrix
For two class it's easy
from sklearn.metrics import confusion_matrix
y_true = [1, 1, 0, 0]
y_pred = [1, 0, 1, 0]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
For multiclass there is one solution, but it does it only for first class. Not all class
def perf_measure(y_actual, y_pred):
class_id = set(y_actual).union(set(y_pred))
TP = []
FP = []
TN = []
FN = []
for index ,_id in enumerate(class_id):
TP.append(0)
FP.append(0)
TN.append(0)
FN.append(0)
for i in range(len(y_pred)):
if y_actual[i] == y_pred[i] == _id:
TP[index] += 1
if y_pred[i] == _id and y_actual[i] != y_pred[i]:
FP[index] += 1
if y_actual[i] == y_pred[i] != _id:
TN[index] += 1
if y_pred[i] != _id and y_actual[i] != y_pred[i]:
FN[index] += 1
return class_id,TP, FP, TN, FN
But this by default calculate for only one class.
But I want to calculate the values for each class given a 4 class. For https://extendsclass.com/csv-editor.html#0697f61
I have done it using excel like this.
Then calculate the results for each
I have automated it in Excel sheet, but is there any programatical solution in python or sklearn to do this ?

This is way easier with multilabel_confusion_matrix. For your example, you can also pass labels=["A", "N", "O", "~"] as an argument to get the labels in the preferred order.
from sklearn.metrics import multilabel_confusion_matrix
import numpy as np
mcm = multilabel_confusion_matrix(y_true, y_pred)
tps = mcm[:, 1, 1]
tns = mcm[:, 0, 0]
recall = tps / (tps + mcm[:, 1, 0]) # Sensitivity
specificity = tns / (tns + mcm[:, 0, 1]) # Specificity
precision = tps / (tps + mcm[:, 0, 1]) # PPV
Which results in an array for each metric:
[[0.83333333 0.94285714 0.64 0.25 ] # Sensitivity / Recall
[0.99029126 0.74509804 0.91666667 1. ] # Specificity
[0.9375 0.83544304 0.66666667 1. ]] # Precision / PPV
Alternatively, you may view class-dependent precision and recall in classification_report. You could get the same lists with output_dict=True and each class label.
>>> print(classification_report(y_true, y_pred))
precision recall f1-score support
A 0.94 0.83 0.88 18
N 0.84 0.94 0.89 70
O 0.67 0.64 0.65 25
~ 1.00 0.25 0.40 8
accuracy 0.82 121
macro avg 0.86 0.67 0.71 121
weighted avg 0.83 0.82 0.81 121

How to view the interactions of all categorical predictors in an OLS model using python's statsmodels?

I have successfully run an OLS model using the statsmodels package in python. However, the model pics one variable as an intercept, and does not include it in the results of interactions. Specifically, I have 5 levels in the "Meal_Cat" category below, and the model picks one of them ("Low" level) and treats it as an intercept. That is okay, but the problem is that I am unable to see its interactions with other categories (such as a Low by Group interaction).
See below for how the model is set up:
model = ols('Cost ~ C(Meal_Cat)*C(Group)*C(Region) + Age + Gender', data= Mealcat_DF).fit()
# Seeing if the overall model is significant
print(f"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}")
model.summary()
I was wondering if anyone can suggest for a way to include all terms from the model in the interaction summary.

If your variable is already string or category variable, you just try the following.
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
df = sns.load_dataset('tips')
formula = 'tip ~ sex*smoker*day + total_bill'
model = smf.ols(formula, data=df)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: tip R-squared: 0.485
Model: OLS Adj. R-squared: 0.449
Method: Least Squares F-statistic: 13.35
Date: Mon, 20 Jan 2020 Prob (F-statistic): 8.29e-25
Time: 14:21:24 Log-Likelihood: -344.02
No. Observations: 244 AIC: 722.0
Df Residuals: 227 BIC: 781.5
Df Model: 16
Covariance Type: nonrobust
=========================================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
Intercept 0.9917 0.357 2.777 0.006 0.288 1.695
sex[T.Female] -0.0731 0.506 -0.144 0.885 -1.071 0.925
smoker[T.No] -0.0427 0.398 -0.107 0.915 -0.827 0.741
day[T.Fri] -0.4549 0.487 -0.933 0.352 -1.415 0.506
day[T.Sat] -0.4662 0.381 -1.224 0.222 -1.217 0.284
day[T.Sun] -0.2880 0.423 -0.681 0.497 -1.121 0.545
sex[T.Female]:smoker[T.No] -0.1423 0.593 -0.240 0.811 -1.311 1.026
sex[T.Female]:day[T.Fri] 0.8553 0.737 1.161 0.247 -0.597 2.307
sex[T.Female]:day[T.Sat] 0.2319 0.605 0.383 0.702 -0.960 1.424
sex[T.Female]:day[T.Sun] 1.0867 0.772 1.407 0.161 -0.435 2.608
smoker[T.No]:day[T.Fri] 0.1224 0.905 0.135 0.893 -1.660 1.905
smoker[T.No]:day[T.Sat] 0.6258 0.480 1.303 0.194 -0.320 1.572
smoker[T.No]:day[T.Sun] 0.2552 0.505 0.506 0.614 -0.739 1.250
sex[T.Female]:smoker[T.No]:day[T.Fri] -0.2185 1.303 -0.168 0.867 -2.787 2.350
sex[T.Female]:smoker[T.No]:day[T.Sat] -0.4487 0.759 -0.591 0.555 -1.944 1.046
sex[T.Female]:smoker[T.No]:day[T.Sun] -0.7027 0.892 -0.788 0.431 -2.460 1.054
total_bill 0.1078 0.008 13.951 0.000 0.093 0.123
==============================================================================
Omnibus: 29.744 Durbin-Watson: 2.154
Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.768
Skew: 0.616 Prob(JB): 6.38e-14
Kurtosis: 5.112 Cond. No. 629.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plotting a barplot with a vertical line in pyplot-seaborn-pandas

I am having trouble doing something that seems to me straightforward.
My data is:
ROE_SP500_Q2_2018_quantile.to_json()
'{"index":{"0":0.0,"1":0.05,"2":0.1,"3":0.15,"4":0.2,"5":0.25,"6":0.3,"7":0.35,"8":0.4,"9":0.45,"10":0.5,"11":0.55,"12":0.6,"13":0.65,"14":0.7,"15":0.75,"16":0.8,"17":0.85,"18":0.9,"19":0.95},"ROE_Quantiles":{"0":-0.8931,"1":-0.0393,"2":0.00569,"3":0.03956,"4":0.05826,"5":0.075825,"6":0.09077,"7":0.10551,"8":0.12044,"9":0.14033,"10":0.15355,"11":0.17335,"12":0.1878,"13":0.209175,"14":0.2357,"15":0.27005,"16":0.3045,"17":0.3745,"18":0.46776,"19":0.73119}}'
My code for the plot is:
plt.close()
plt.figure(figsize=(14,8))
sns.barplot(x = 'Quantile', y = 'ROE', data = ROE_SP500_Q2_2018_quantile)
plt.vlines(x = 0.73, ymin = 0, ymax = 0.6, color = 'blue', size = 2)
plt.show()
which returns the following image:
I would like to correct the following problems:
a) The ticklabels which are overly crowded in a strange way I do not understand
b) The vline which appears in the wrong place. I am using the wrong argument to set the thickness of the line and I get an error.

Pass to parameter data DataFrame, check seaborn.barplot:
data : DataFrame, array, or list of arrays, optional
Dataset for plotting. If x and y are absent, this is interpreted as wide-form. Otherwise it is expected to be long-form.
sns.barplot(x = 'index', y = 'ROE_Quantiles', data = ROE_SP500_Q2_2018_quantile)
#TypeError: vlines() missing 2 required positional arguments: 'ymin' and 'ymax'
plt.vlines(x = 0.73, ymin = 0, ymax = 0.6, color = 'blue', linewidth=5)
j = '{"index":{"0":0.0,"1":0.05,"2":0.1,"3":0.15,"4":0.2,"5":0.25,"6":0.3,"7":0.35,"8":0.4,"9":0.45,"10":0.5,"11":0.55,"12":0.6,"13":0.65,"14":0.7,"15":0.75,"16":0.8,"17":0.85,"18":0.9,"19":0.95},"ROE_Quantiles":{"0":-0.8931,"1":-0.0393,"2":0.00569,"3":0.03956,"4":0.05826,"5":0.075825,"6":0.09077,"7":0.10551,"8":0.12044,"9":0.14033,"10":0.15355,"11":0.17335,"12":0.1878,"13":0.209175,"14":0.2357,"15":0.27005,"16":0.3045,"17":0.3745,"18":0.46776,"19":0.73119}}'
import ast
df = pd.DataFrame(ast.literal_eval(j))
print (df)
index ROE_Quantiles
0 0.00 -0.893100
1 0.05 -0.039300
10 0.50 0.153550
11 0.55 0.173350
12 0.60 0.187800
13 0.65 0.209175
14 0.70 0.235700
15 0.75 0.270050
16 0.80 0.304500
17 0.85 0.374500
18 0.90 0.467760
19 0.95 0.731190
2 0.10 0.005690
3 0.15 0.039560
4 0.20 0.058260
5 0.25 0.075825
6 0.30 0.090770
7 0.35 0.105510
8 0.40 0.120440
9 0.45 0.140330
plt.close()
plt.figure(figsize=(14,8))
sns.barplot(x = 'index', y = 'ROE_Quantiles', data = df)
plt.vlines(x = 0.73, ymin = 0, ymax = 0.6, color = 'blue', linewidth=5)
plt.show()

Gnuplot. A singular function fitting A(sin(bx)/(b*x))**2

How can I fit the data with f(x) = A*(sin(b*x)/(b*x))**2?
The data.dat file content is:
-3.7 0.020505941
-3.6 0.015109903
-3.5 0.010044806
-3.4 0.005648897
-3.3 0.002285005
-3.2 0.000332768
-3.1 0.000179912
-3 0.002212762
-2.9 0.006806212
-2.8 0.014313401
-2.7 0.025055358
-2.6 0.039310897
-2.5 0.057307025
-2.4 0.079210158
-2.3 0.105118386
-2.2 0.135055049
-2.1 0.168963812
-2 0.206705453
-1.9 0.24805647
-1.8 0.292709632
-1.7 0.340276504
-1.6 0.390291948
-1.5 0.442220555
-1.4 0.495464883
-1.3 0.549375371
-1.2 0.603261707
-1.1 0.65640542
-1 0.708073418
-0.9 0.757532157
-0.8 0.804062127
-0.7 0.846972303
-0.6 0.88561423
-0.5 0.919395388
-0.4 0.947791533
-0.3 0.970357695
-0.2 0.986737575
-0.1 0.996671108
0 1
0.1 0.996671108
0.2 0.986737575
0.3 0.970357695
0.4 0.947791533
0.5 0.919395388
0.6 0.88561423
0.7 0.846972303
0.8 0.804062127
0.9 0.757532157
1 0.708073418
1.1 0.65640542
1.2 0.603261707
1.3 0.549375371
1.4 0.495464883
1.5 0.442220555
1.6 0.390291948
1.7 0.340276504
1.8 0.292709632
1.9 0.24805647
2 0.206705453
2.1 0.168963812
2.2 0.135055049
2.3 0.105118386
2.4 0.079210158
2.5 0.057307025
2.6 0.039310897
2.7 0.025055358
2.8 0.014313401
2.9 0.006806212
3 0.002212762
3.1 0.000179912
3.2 0.000332768
3.3 0.002285005
3.4 0.005648897
3.5 0.010044806
3.6 0.015109903
3.7 0.020505941
3.8 0.025925906
My code for fitting below:
f(x) = A*(sin(b*x)/(b*x))**2;
A = 1;
b = 1;
fit f(x) "data.dat" u 1:2 via A,b;
plot [x=-3:3] f(x);
I got an error Undefined value during function evaluation.

I guess that unlike plot, fit doesn't ignore points for which the function being evaluated produces an undefined value. In your particular case, you might reformulate the problem and fit f(x)*x*x to y(x)*x*x in order to remove the "singularity" at zero. For example:
set terminal pngcairo
set output 'fig.png'
f(x) = A*(sin(b*x)/(b*x))**2;
g(x) = A*(sin(b*x)/(b))**2;
fit g(x) 'data.dat' u 1:($2*$1*$1) via A, b;
plot \
g(x)/(x*x) t 'fit', \
'data.dat' w p t 'points'
This produces:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas: re-shape/ re-pivot a data frame after groupby - python-3.x

Related

Need help finding the best time performing lines of code to replace existing-working code

How to calculate Sensitivity, specificity and pos predictivity for each class in multi class classficaition

How to view the interactions of all categorical predictors in an OLS model using python's statsmodels?

Plotting a barplot with a vertical line in pyplot-seaborn-pandas

Gnuplot. A singular function fitting A(sin(bx)/(b*x))**2

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas: re-shape/ re-pivot a data frame after groupby - python-3.x

Related

Need help finding the best time performing lines of code to replace existing-working code

How to calculate Sensitivity, specificity and pos predictivity for each class in multi class classficaition

How to view the interactions of all categorical predictors in an OLS model using python's statsmodels?

Plotting a barplot with a vertical line in pyplot-seaborn-pandas

Gnuplot. A singular function fitting A*(sin(b*x)/(b*x))**2

Categories

Resources

Gnuplot. A singular function fitting A(sin(bx)/(b*x))**2