With the scipy.stats package it is straightforward to fit a distribution to data, e.g. scipy.stats.expon.fit() can be used to fit data to an exponential distribution.
However, I am trying to fit data to a censored/conditional distribution in the exponential family. In other words, using MLE, I am trying to find the maximum of
,
where is a PDF of a distribution in the exponential family, and is its corresponding CDF.
Mathematically, I have found that the log-likelihood function is convex in the parameter space , so my assumption was that it should be relatively straightforward to apply the scipy.optimize.minimize function. Notice in the above log-likelihood that by taking we obtain the traditional/uncensored MLE problem.
However, I find that even for simple distributions that e.g. the nelder-mead simplex algorithm does not always converge, or that it does converge but the estimated parameters are far off from the true ones. I have attached my code below. Notice that one can choose a distribution, and that the code is generic enough to fit the loc and scale parameters, as well as the optional shape parameters (for e.g. a Beta or Gamma distribution).
My question is: what am I doing wrong to obtain these bad estimates, or sometimes get convergence issues? I have tried a few algorithms but there is not one that easily works, to my surprise as the problem is convex. Are there smoothness issues, and that I need to find a way to use the Jacobian and Hessian in a generic way for this problem?
Are there other methods to tackle this problem? Initially I thought to override fit() function in the scipy.stats.rv class to take care of the censoring with the CDF, but this seemed quite cumbersome. But since the problem is convex, I would guess that using the minimize function of scipy I should be able to easily get the results...
Comments and help are very welcome!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import expon, gamma, beta, norm
from scipy.optimize import minimize
from scipy.stats import rv_continuous as rv
def log_likelihood(func: rv, delays, max_delays=10**8, **func_pars)->float:
return np.sum(np.log(func.pdf(delays, **func_pars)+1) - np.log(func.cdf(max_delays, **func_pars)))
def minimize_log_likelihood(func: rv, delays, max_delays):
# Determine number of parameters to estimate (always 'loc', 'scale', sometimes shape parameters)
n_pars = 2 + func.numargs
# Initialize guess (loc, scale, [shapes])
x0 = np.ones(n_pars)
def wrapper(params, *args):
func = args[0]
delays = args[1]
max_delays = args[2]
loc, scale = params[0], params[1]
# Set 'loc' and 'scale' parameters
kwargs = {'loc': loc, 'scale': scale}
# Add shape parameters if existing to kwargs
if func.shapes is not None:
for i, s in enumerate(func.shapes.split(', ')):
kwargs[s] = params[2+i]
return -log_likelihood(func=func, delays=delays, max_delays=max_delays, **kwargs)
# Find maximum of log-likelihood (thus minimum of minus log-likelihood; see wrapper function)
return minimize(wrapper, x0, args=(func, delays, max_delays), options={'disp': True},
method='nelder-mead', tol=1e-8)
# Test code with by sampling from known distribution, and retrieve parameters
distribution = expon
dist_pars = {'loc': 0, 'scale': 4}
x = np.linspace(distribution.ppf(0.0001, **dist_pars), distribution.ppf(0.9999, **dist_pars), 1000)
res = minimize_log_likelihood(distribution, x, 10**8)
print(res)
I have found that the convergence is bad due to numerical inaccuracies. Best is to replace
np.log(func.pdf(x, **func_kwargs))
with
func.logpdf(x, **func_kwargs)
This leads to correct estimation of the parameters. The same holds for the CDF. The documentation of scipy also indicates that the numerical accuracy of the latter performs better.
This all works nicely with the Exponential, Normal, Gamma, chi2 distributions. The Beta distribution still gives me issues, but I think this is again to some (other) numerical inaccuracies which I will analyse separately.
Related
I have read the documentation of the decision function and score_samples here, but could not figure out what is the difference between these two methods and which one should I use for an outlier detection algorithm.
Any help would be appreciated.
See the documentation for the attribute offset_:
Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. offset_ is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.
The User Guide references the paper Isolation forest written by Fei Tony, Kai Ming and Zhi-Hua.
I did not read the paper, but I think you can use either output to detect outliers. The documentation says score_samples is the opposite of decision_function, so I thought they would be inversely related, but both outputs seem to have the exact same relationship with the target. The only difference is that they are on different ranges. In fact, they even have the same variance.
To see this, I fit the model to the breast cancer dataset available in sklearn and visualized the average of the target variable grouped by the deciles of each output. As you can see, they both have the exact same relationship.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest
# Load data
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']
# Fit model
clf = IsolationForest()
clf.fit(X, y)
# Split the outputs into deciles to see their relationship with target
t = pd.DataFrame({'target':y,
'decision_function':clf.decision_function(X),
'score_samples':clf.score_samples(X)})
t['bins_decision_function'] = pd.qcut(t['decision_function'], 10)
t['bins_score_samples'] = pd.qcut(t['score_samples'], 10)
# Visualize relationship
plt.plot(t.groupby('bins_decision_function')['target'].mean().values, lw=3, label='Decision Function')
plt.plot(t.groupby('bins_score_samples')['target'].mean().values, ls='--', label='Score Samples')
plt.legend()
plt.show()
Like I said, they even have the same variance:
t[['decision_function','score_samples']].var()
> decision_function 0.003039
> score_samples 0.003039
> dtype: float64
In conclusion, you can use them interchangeably as they both share the same relationship with the target.
As was previously stated in #Ben Reiniger's answer,
decision_function = score_samples - offset_. For further clarification...
If contamination = 'auto', then offset_ is fixed to 0.5
If contamination is set to something other than 'auto', then
offset is no longer fixed.
This can be seen under the fit function in the source code:
def fit(self, X, y=None, sample_weight=None):
...
if self.contamination == "auto":
# 0.5 plays a special role as described in the original paper.
# we take the opposite as we consider the opposite of their score.
self.offset_ = -0.5
return self
# else, define offset_ wrt contamination parameter
self.offset_ = np.percentile(self.score_samples(X),
100. * self.contamination)
Thus, it's important to take note of what contamination is set to, as well as which anomaly scores you are using. score_samples returns what can be thought of as the "raw" scores, as it is unaffected by offset_, whereas decision_function is dependent on offset_
I wish to speed up the sparse system solver part of my code using Numba. Here is what I have up till now:
# Both numba and numba-scipy packages are installed. I am using PyCharm IDE
import numba
import numba_scipy
# import other required stuff
#numba.jit(nopython=True)
def solve_using_numba(A, b):
return sp.linalg.gmres(A, b)
# total = the number of points in the system
A = sp.lil_matrix((total, total), dtype=float)
# populate A with appropriate data
A = A.tocsc()
b = np.zeros((total, 1), dtype=float)
# populate b with appropriate data
y, exit_code = solve_using_numba(A, b)
# plot solution
This raises the error
argument 0: cannot determine Numba type of <class 'scipy.sparse.csc.csc_matrix'>
In the official documentation, numba-scipy extends Numba to make it aware of SciPy. But it seems that here, numba cannot work with scipy sparse matrix classes. Where am I going wrong and what can I do to fix this?
I only need to speed up the sparse system solution part of the code because the other stuff is pretty lightweight like taking a couple of user inputs, constructing the A and b matrices, and plotting the end result.
The gradients from tf.GradientTape seem not to match the correct minimum in the function I'm trying to minimise.
I'm trying to use tensorflowprobability's black-box variational inference (using tf2), with the tf.GradientTape, a keras optimizer, calling the apply_gradients function. The surrogate posterior is a simple 1d Normal. I'm trying to approximate a pair of normals, see pdist function. For simplicity I just try to optimise the scale parameter.
Current code:
from scipy.special import erf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
def pdist(x):
return (.5/np.sqrt(2*np.pi)) * np.exp((-(x+3)**2)/2) + (.5/np.sqrt(2*np.pi)) * np.exp((-(x-3)**2)/2)
def logpdist(x):
logp = np.log(1e-30+pdist(x))
assert np.all(np.isfinite(logp))
return logp
optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
mu = tf.Variable(0.0,dtype=tf.float64)
scale = tf.Variable(1.0,dtype=tf.float64)
for it in range(100):
with tf.GradientTape() as tape:
surrogate_posterior = tfd.Normal(mu,scale)
elbo_loss = tfp.vi.monte_carlo_variational_loss(logpdist,surrogate_posterior,sample_size=10000)
gradients = tape.gradient(elbo_loss, [scale])
optimizer.apply_gradients(zip(gradients, [scale]))
if it%10==0: print(scale.numpy(),gradients[0].numpy(),elbo_loss.numpy())
Output (showing every 10th iteration):
SCALE GRAD ELBO_LOSS
1.100, -1.000, 2.697
2.059, -0.508, 1.183
2.903, -0.354, 0.859 <<< (right answer about here)
3.636, -0.280, 1.208
4.283, -0.237, 1.989
4.869, -0.208, 3.021
5.411, -0.187, 4.310
5.923, -0.170, 5.525
6.413, -0.157, 7.250
6.885, -0.146, 8.775
For some reason the gradient doesn't reflect the true gradient, which should be about zero around scale=2.74.
Why does the gradient not relate to the actual elbo_loss?
Hopefully someone can elaborate on why the previous implementation failed (and also why it doesn't except, but instead just has the wrong answer). Anyway, I found I could fix it by ensuring that key expressions used the tensorflow maths library and not numpy's. Specifically replacing the two methods above with;
def pdist(x):
return (.5/np.sqrt(2*np.pi)) * tf.exp((-(x+3)**2)/2) + (.5/np.sqrt(2*np.pi)) * tf.exp((-(x-3)**2)/2)
def logpdist(x):
return tf.math.log(pdist(x))
The stochastic optimisation now works.
Output:
2.020, -0.874, 1.177
2.399, -0.393, 0.916
2.662, -0.089, 0.857
2.761, 0.019, 0.850
2.765, 0.022, 0.843
2.745, -0.006, 0.851
2.741, 0.017, 0.845
2.752, 0.005, 0.852
2.744, 0.015, 0.852
2.747, 0.013, 0.862
I'm not going to accept my own answer as I'd be grateful if some answers could be given that give intuition about why this now works and why it failed previously (and why the failure mode wasn't an exception or similar but instead an incorrect gradient).
I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).
I am trying to fit a function which looks like log(y)=a*log(b-x)+c, where a, b and c are the parameters that need to be fitted. The relevant bit of code is
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def logfunc(T, a, b, c):
v=(a*np.log(b-T))+c
return v
popt, pcov=curve_fit(logfunc, T, np.log(Energy), check_finite=False, bounds=([0.1, 1.8, 0.1], [1.0, 2.6, 1.0]))
plt.plot(T, logfunc(T, *popt))
plt.show
Where T and Energy is some data that was generated (I use it to plot other things so the data should be fine). T is between 0.3 and 3.2. I am pretty sure that the problem is the fact that there is a point where b=T because I keep getting the error ValueError: Residuals are not finite in the initial point. but I am not sure how to solve this.
You may find the lmfit package (http://lmfit.github.io/lmfit-py/) useful for this sort of problem. This provides a higher-level approach to curve fitting problems and a better abstraction of Parameters and Models than scipy.optimize package or curve_fit() function.
For the problem here, two important features of lmfit are
the ability to set bounds on variables. curve_fit() can do this as well, but only by working with ordered lists of min/max bounds. With lmfit, the bounds belong to Parameter objects.
having a way to explicitly set a policy for handling NaN values, which could definitely cause problems for your fit.
With lmfit, your script would be written approximately as
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
def logfunc(T, a, b, c):
return (a*np.log(b-T))+c
log_model = Model(logfunc, nan_policy='raise') # raise error on NaNs
params = log_model.make_params(a=0.5, b=2.0, c=0.5) # initial values
params['b'].min = 1.8 # set min/max values
params['b'].max = 2.6
params['c'].min = 0.1 # and so forth
result = log_model.fit(np.log(Energy), params, T=T)
print(result.fit_report())
plt.plot(T, Energy, 'bo', label='data')
plt.plot(T, np.exp(result.best_fit), 'r--', label='fit')
plt.legend()
plt.xlabel('T')
plt.ylabel('Energy')
plt.gca().set_yscale('log', basey=10)
plt.show()
This is slightly more verbose than your starting script because it gives a labeled plot and because using Parameter objects instead of scalars gives more flexibility and clarity.
For your fit, you might consider setting the nan_policy to 'omit', which will omit NaNs as they occur -- never a great idea, but sometimes helpful to get you started on finding where log(b-T) is valid. You could also alter your model function to do something like
def logfunc(T, a, b, c):
arg = b - T
arg[np.where(arg < 1.e-16)] = 1.e-16
return a*np.log(arg) + c
To explicitly prevent one obvious cause of NaNs.
Residuals are not finite in the initial point
means the initial point is bad, where some logarithms are infinite or undefined. You need a better initial point.
By the nature of the model, b has to be greater than any of the points in T. The bounds on b that you have at present do not guarantee that. Tighten them up.
When you do not provide p0 parameter, SciPy will take a guess within the provided bounds. So if the bounds guarantee finiteness, the error will not occur.
Still, it is generally better to prescribe p0 yourself, because you have better a priori understanding of the problem than SciPy does.
A working example with adjusted bounds:
popt, pcov=curve_fit(logfunc, np.linspace(0.3, 3.2, 6), [8, 7, 6, 5, 4, 3], bounds=([0.1, 3.2, 0.1], [1.0, 3.6, 1.0]))