How to set the maxfun limit of the lbfgs solver on scikit-learn LogisticRegression model? - scikit-learn

My scikit-learn LogisticRegression model, which uses the lbfgs solver, is stopping early as shown in the logs bellow. The data is standardized.
(...)
At iterate13150 f= 4.05397D+03 |proj g|= 2.41194D+04
At iterate13200 f= 4.05213D+03 |proj g|= 1.36863D+04
.venv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
[Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 5.5s finished
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
62 13240 15001 1 0 0 4.800D+04 4.051D+03
F = 4051.0211050375365
sklearn uses the scipy implementation of the lbfgs solver. The function scipy/optimize/_lbfgsb_py.py:_minimize_lbfgsb has the following early stop conditions
if n_iterations >= maxiter:
task[:] = 'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
elif sf.nfev > maxfun:
task[:] = ('STOP: TOTAL NO. of f AND g EVALUATIONS '
'EXCEEDS LIMIT')
I am indeed hitting the sf.nfev > maxfun limit. Unfortunatly, sklearn fixes the value of maxfun to 15_000 when it instanciates the scipy solver (`sklearn/linear_model/_logistic.py:442).
When I hotfix the sklearn package to set maxfun to 100_000, the solver converges. But this is not a real solution (since I do not want to carry arround a custom sklearn dist with one different constant).
Any ideas on how to set the maxfun parameter in another way?

Related

How to customize threshold PyTorch

I have trained ResNet50 for binary image classification.
I want to descrease FalseNegatives by reducing threshold value.
How can I do that?
To decrease the number of false negatives (FN) i.e. increase the recall (since recall = TP / (TP + FN)) you should increase the positive weight (the weight of the occurrence of that class) above 1. For example nn.BCEWithLogitsLoss allows you to provide the pos_weight option:
pos_weight > 1 increases the recall, pos_weight < 1 increases the precision.
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3. The loss would act as if the dataset contains 3*100 = 300 positive examples.
As a side note, the explicit expression for the binary cross entropy with logits (where "with logits" should rather be understood as "from logits") is:
>>> z = torch.sigmoid(q)
>>> loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
Above q are the raw logit values while w_p is the weight of the positive instance.

Finding probability for Discrete Binomial distribution problems

Problem Description:
In each of 4 different competitions, Jin has 60% chance of winning. Assuming that the competitions are independent of each other, what is the probability that: Jin will win at least 1 race?
Given Binomial distribution Parameters:
n=4
p=0.60
Hint:
P(x>=1)=1-P(x=0)
Use the binom.pmf() function of scipy.stats package to calculate the probability.
Below is the python code i have tried. But it is being evaluated as wrong.
from scipy import stats
n = 4
p = 0.6
p1 = 1 - p
p2 = stats.binom.pmf(1,4,p1)
print(p1)
Using the hint, all you need to do is to evaluate the PMF of the binomial distribution at x=0 and subtract the result from 1 to obtain the probability of Jin winning at least one competition:
from scipy import stats
x=0
n=4
p=0.6
p0 = stats.binom.pmf(x,n,p)
print(1-p0)

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

Negative values for a non-negative parameter in tensorflow probablity

I'm trying to fit a simple Dirichlet-Multinomial model in tensorflow probability. The concentration parameters are gamma and I have put a Gamma(1,1) prior distribution on them. This is the model, where S is the number of categories and N is the number of samples:
def dirichlet_model(S, N):
gamma = ed.Gamma(tf.ones(S)*1.0, tf.ones(S)*1.0, name='gamma')
y = ed.DirichletMultinomial(total_count=500., concentration=gamma, sample_shape=(N), name='y')
return y
log_joint = ed.make_log_joint_fn(dirichlet_model)
However, when I try to sample from this using HMC, the acceptance rate is zero, and the initial draw for gamma contains negative values. Am I doing something wrong? Shouldn't negative proposals for the concentration parameters be rejected automatically? Below my sampling code:
def target_log_prob_fn(gamma):
"""Unnormalized target density as a function of states."""
return log_joint(
S=S, N=N,
gamma=gamma,
y=y_new)
num_results = 5000
num_burnin_steps = 3000
states, kernel_results = tfp.mcmc.sample_chain(
num_results=num_results,
num_burnin_steps=num_burnin_steps,
current_state=[
tf.ones([5], name='init_gamma')*5,
],
kernel=tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=target_log_prob_fn,
step_size=0.4,
num_leapfrog_steps=3))
gamma = states
with tf.Session() as sess:
[
gamma_,
is_accepted_,
] = sess.run([
gamma,
kernel_results.is_accepted,
])
num_accepted = np.sum(is_accepted_)
print('Acceptance rate: {}'.format(num_accepted / num_results))
Try reducing step size to increase acceptance rate. Optimal acceptance rate for HMC is around .651 (https://arxiv.org/abs/1001.4460). Not sure why you'd see negative values. Maybe floating point error near zero? Can you post some of the logs of your run?

How to avoid impression bias when calculate the ctr?

When we train a ctr(click through rate) model, sometimes we need calcute the real ctr from the history data, like this
#(click)
ctr = ----------------
#(impressions)
We know that, if the number of impressions is too small, the calculted ctr is not real. So we always set a threshold to filter out the large enough impressions.
But we know that the higher impressions, the higher confidence for the ctr. Then my question is that: Is there a impressions-normalized statistic method to calculate the ctr?
Thanks!
You probably need a representation of confidence interval for your estimated ctr. Wilson score interval is a good one to try.
You need below stats to calculate the confidence score:
\hat p is the observed ctr (fraction of #clicked vs #impressions)
n is the total number of impressions
zα/2 is the (1-α/2) quantile of the standard normal distribution
A simple implementation in python is shown below, I use z(1-α/2)=1.96 which corresponds to a 95% confidence interval. I attached 3 test results at the end of the code.
# clicks # impressions # conf interval
2 10 (0.07, 0.45)
20 100 (0.14, 0.27)
200 1000 (0.18, 0.22)
Now you can set up some threshold to use the calculated confidence interval.
from math import sqrt
def confidence(clicks, impressions):
n = impressions
if n == 0: return 0
z = 1.96 #1.96 -> 95% confidence
phat = float(clicks) / n
denorm = 1. + (z*z/n)
enum1 = phat + z*z/(2*n)
enum2 = z * sqrt(phat*(1-phat)/n + z*z/(4*n*n))
return (enum1-enum2)/denorm, (enum1+enum2)/denorm
def wilson(clicks, impressions):
if impressions == 0:
return 0
else:
return confidence(clicks, impressions)
if __name__ == '__main__':
print wilson(2,10)
print wilson(20,100)
print wilson(200,1000)
"""
--------------------
results:
(0.07048879557839793, 0.4518041980521754)
(0.14384999046998084, 0.27112660859398174)
(0.1805388068716823, 0.22099327100894336)
"""
If you treat this as a binomial parameter, you can do Bayesian estimation. If your prior on ctr is uniform (a Beta distribution with parameters (1,1)) then your posterior is Beta(1+#click, 1+#impressions-#click). Your posterior mean is #click+1 / #impressions+2 if you want a single summary statistic of this posterior, but you probably don't, and here's why:
I don't know what your method for determining whether ctr is high enough, but let's say you're interested in everything with ctr > 0.9. You can then use the cumulative density function of the beta distribution to look at what proportion of probability mass is over the 0.9 threshold (this will just be 1 - the cdf at 0.9). In this way, your threshold will naturally incorporate uncertainty about the estimate because of limited sample size.
There are many ways to calculate this confidence interval. An alternative to the Wilson Score is the Clopper-Perrson interval, which I found useful in spreadsheets.
Upper Bound Equation
Lower Bound Equation
Where
B() is the the Inverse Beta Distribution
alpha is the confidence level error (e.g for 95% confidence-level, alpha is 5%)
n is the number of samples (e.g. impressions)
x is the number of successes (e.g. clicks)
In Excel an implementation for B() is provided by the BETA.INV formula.
There is no equivalent formula for B() in Google Sheets, but a Google Apps Script custom function can be adapted from the JavaScript Statistical Library (e.g search github for jstat)

Resources