Deriving new continuous variable out of logistic regression coefficients - python-3.x

I have a set of independent variables X and set of values of dependent variable Y. The task at hand is binomial classification, i.e. predict whether debtor will default on his debt (1) or not (0).
After filtering out statistically insignificant variables and variables that bring about multicollinearity I am left with following summary of logistic regression model:
Accuracy ~0.87
Confusion matrix [[1038 254]
[72 1182]]
Parameters Coefficients
intercept -4.210
A 5.119
B 0.873
C -1.414
D 3.757
Now, I convert these coefficients into new continuous variable "default_probability" via log odds_ratio, i.e.
import math
e = math.e
power = (-4.210*1) + (A*5.119) + (B*0.873) + (C*-1.414) + (D*3.757)
default_probability = (e**power)/(1+(e**power))
When I divide my original dataset into quartiles according to this new continuos variable "default_probability", then:
1st quartile contains 65% of defaulted debts (577 out of 884 incidents)
2nd quartile contains 23% of defaulted debts (206 out of 884 incidents)
3rd quartile contains 9% of defaulted debts (77 out of 884 incidents)
4th quartile contains 3% of defaulted debts (24 out of 884 incidents)
At the same time:
overall quantity of debtors in 1st quartile - 1145
overall quantity of debtors in 1st quartile - 516
overall quantity of debtors in 1st quartile - 255
overall quantity of debtors in 1st quartile - 3043
I wanted to use "default probability" to surgically remove the most problematic credits by imposing the business-rule "no credit to the 1st quartile", but now I wonder whether it is "surgical" at all (by this rule I will lose (1145 - 577 = 568 "good" clients) and overall is it mathematically/logically correct to derive new continuous variables for the dataset out of the coefficients of logistic regression by the line of reasoning described above?

You have forgotten the intercept when you compute power. But supposing this is only a typo like you said in the comments, then your approach is valid. However, you might want to use scikit-learn's predict_proba function, which will save you the trouble. Example:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
lr = LogisticRegression()
lr.fit(X,y)
Suppose I then want to compute the probability of belonging to class 1 for a given observation (say observation i), I can do what you have done, essentially using the regression coefficients and the intercept like you have done:
i = 0
1/(1+np.exp(-X[i].dot(lr.coef_[0])-lr.intercept_[0]))
Or just do :
lr.predict_proba(X)[i][1]
which is faster

Related

True Positive value difference in confusion matrix

To assess accuracy for LULCC, I have used the confusion matrix from pandas_ml. However, the statistic report has made me confused. The actual vs Predicted matrix indicates 20 (points) for LSAgri class but TP value is 57 for LSAgri. Shouldn't these two values need to be identical? class statistic vs CM

Mean Square Error (MSE) Root Mean Square Error (RMSE)

I'm on a project for the study that I used the mean_square_error from skelarn to get my MSE and RMSE.
I can´t understand what the information means.
I'm using a dataset about House Sales and I want to predict the price of the house using Linear Regression. When I put my predict price and real price, I got the results:
MSE: 1114197668.6920328 RMSE: 33379.59958855158
What this information really means actually? That my predict will have a mean difference in price about 33379.60?
MSE = mean_squared_error(predict,testSalePrice)
RMSE = np.sqrt(MSE)
Mean Squared Error:
In statistics, the mean squared error (MSE) or mean squared deviation
(MSD) of an estimator (of a procedure for estimating an unobserved
quantity) measures the average of the squares of the errors.
So for example let's assume you have three datapoints:
Price Predicted
1900 2000
2000 2000
2100 2000
Then the MSE is: 1/3 * ((-100)*(-100)+ (0)*(0) + (100)*(100)) = 1/3 * (20000) = 6000
The perfect one would be 0, but this you will probably not reach. You have to interpret it in comparison with your actual value range.
The RMSE in this case would be: SQRT(6000) = 77,..
This is more intepretable, that means on average you are 77 away from your prediction, which makes sense if you see the three results

How can I use Prophet for per-location forecasting given overall sales data

I am using Prophet for Sales forecasting, and I have a several CSVs. Most of the represent sales data by date for a specific location (e.g. "Location1.CSV has "Jan 1, 2010, X widgets sold", etc.)
There's a master CSV which aggregates sales across all locations. I have used Prophet to forecast Sales across all locations and that works well, but the per-location data is very variable.
I'm seeing much higher Mean Average Errors (MAE) for per-store forecasts while the overall model has much lower MAE.
Is there any way I can use the overall Sales model to try to predict per-location sales? Or any alternatives to forecasting per-location Sales besides just using the raw sales data for that location?
Yes, you can use your overall sales model to help predict the per-location sales in Prophet using the add_regressor method.
Let's first create a sample df, where y is the variable we want to predict (per-location sales) and overalls are the overall sales:
import pandas as pd
df = pd.DataFrame(pd.date_range(start="2019-09-01", end="2019-09-30", freq='D', name='ds'))
df["y"] = range(1,31)
df["overalls"] = range(101,131)
df.head()
ds y overalls
0 2019-09-01 1 101
1 2019-09-02 2 102
2 2019-09-03 3 103
3 2019-09-04 4 104
4 2019-09-05 5 105
and split train and test:
df_train = df.loc[df["ds"]<"2019-09-21"]
df_test = df.loc[df["ds"]>="2019-09-21"]
Before training the forecaster, we can add regressors that use additional variables. Here the argument of add_regressor is the column name of the additional variable in the training df.
from fbprophet import Prophet
m = Prophet()
m.add_regressor('overalls')
m.fit(df_train)
The predict method will then use the additional variables to forecast:
forecast = m.predict(df_test.drop(columns="y"))
Note that the additional variables should have values for your future (test) data. As you initially don't have the future overall sales, you could start by predicting overalls with univariate timeseries, and then predict y with add_regressor and the predicted overalls as future values of the additional variable.
See also this notebook, with an example of using weather factors as extra regressors in a forecast of bicycle usage.

understanding of result of logistic regression

let us suppose we have following data with binary response output(coupon)
annual spending is given in 1000th unit, my goal is to estimate whether if customer spend more then 2000 and has Simmons card, will also have coupon, first of all i have sorted data according to response data, i got following picture
at next stage i have calculated logit for each data, for those initially i choose following coefficient
B0 0.1
B1 0.1
B2 0.1
and i have calculated L according to the following formula
at next stage i have calculated e^L (which in excel can be done easily by exp function )
=EXP(D2)
after that i have calculated probability
=E2/(1+E2)
and finally using formula
i have calculated log likelihood function
then i have calculated sum and using solver i have calculated coefficient that minimize this sum( please pay attention that values are given in negative value) , but i have got all coefficient zero
i am wrong ? or does it means that i can'predict buying of coupon on the base of Annual spending and owning of Simmons card? thanks in advance
You can predict the buying of a coupon on the base of Annual spending (and knowing Simmons card doesn't help).
Admittedly I didn't solve it in Excel, but I suspect the problem might be that your optimization didn't converge (i.e., failed to reach the correct coefficients through the solving process) -- the correct coefficients are B0 = 5.63, B1 = -2.95, and B2 = 0. I found an online reference for the Excel logistic regression procedure at http://blog.excelmasterseries.com/2014/06/logistic-regression-performed-in-excel.html.
I ran the logistic regression myself and found that Annual spending is significant (at the 0.05 level) whereas Simmons card is not. Re-running the model with Simmons card removed yields the following equations:
L = 5.63 - 2.95 * Annual spending
P(1) = exp(L)/(1 + exp(L))
If P(1) > 0.5 => coupon = 1
Although the entropy Rsquare is low at 0.39 (and the number of data points is very low), the model is statistically significant.

Statistical correlation: Pearson or Spearman?

I have 2 series of 45 values in the interval [0,1]. The first series is a human-generated standard, the second one is computer-generated (full series here http://www.copypastecode.com/74844/).
The first series is sorted decreasingly.
0.909090909 0.216196598
0.909090909 0.111282099
0.9 0.021432587
0.9 0.033901106
...
0.1 0.003099256
0 0.001084533
0 0.008882249
0 0.006501463
Now what I want to assess is the degree to which the order is preserved in the second series, given that the first series is monotonic.
The Pearson correlation is 0.454763067, but I think that the relationship is not linear so this value is difficult to interpret.
A natural approach would be to use the Spearman rank correlation, which in this case is 0.670556181.
I noticed that with random values, while Pearson is very close to 0, the Spearman rank correlation goes up to 0.5, so a value of 0.67 seems very low.
What would you use to assess the order similarity between these 2 series?
I want to assess is the degree to which the order is preserved
Since it's the order (rank) that you care about, Spearman rank correlation is the more meaningful metric here.
I noticed that with random values [...] the Spearman rank correlation goes up to 0.5
How do you generate those random values? I've just conducted a simple experiment with some random numbers generated using numpy, and I am not seeing that:
In [1]: import numpy as np
In [2]: import scipy.stats
In [3]: x = np.random.randn(1000)
In [4]: y = np.random.randn(1000)
In [5]: print scipy.stats.spearmanr(x, y)
(-0.013847401847401847, 0.66184551507218536)
The first number (-0.01) is the rank correlation coefficient; the second number (0.66) is the associated p-value.

Resources