Plotting residuals of masked values with `statsmodels` - python-3.x

I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?

Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()

Related

Python Scipy Curvefit to Linear Quadratic Curve

I'm trying to fit a linear quadratic model curve to experiment data. The Y axis values reduce from 1 to 10^-5. When I use the following code, the resulting curve often seems to not fit the data at higher X values. I have a suspicion that because the Y values at high X values are so small, the resulting difference between the experiment value and model value is small. But I would like the model curve to pass as close to the higher X value points as possible (even if it means the low values are not as well fitted). I haven't found anything about weighting in scipy.optimize.curve_fit, other than using standard deviations (which I don't have). How can I improve my model fit at high X values?
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def lq(x, a, b):
#y(x) = exp[-(ax+bx²)]
y = []
for i in x:
x2=i**2
ax = a*i
bx2 = b*x2
y.append(np.exp(-(ax+bx2)))
return y
#x and y are from experiment
x=[0,1.778,2.921,3.302,6.317,9.524,10.54]
y=[1,0.831763771,0.598411595,0.656145266,0.207014135,0.016218101,0.004102041]
(a,b), pcov = curve_fit(lq, x, y, p0=[0.05,0.05])
#make the model curve using a and b
xmodel = list(range(0,20))
ymodel = lq(xmodel, a, b)
fig, ax1 = plt.subplots()
ax1.set_yscale('log')
ax1.plot(x,y, "ro", label="Experiment")
ax1.plot(xmodel,ymodel, "r--", label="Model")
plt.show()
I agree with your assessment that the fit is not very sensitive to small misfits for the small values of y. Since you are plotting the data and fit on a semi-log plot, I think that what you really want is to fit in the log-space as well. That is, you could fit log(y) to a quadratic function. As an aside (but an important one if you're going to be doing numerical work with Python), you should not loop over lists but rather use numpy arrays: this will make everything faster and simpler. With such changes, your script might look like
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def lq(x, a, b):
return -(a*x+b*x*x)
x = np.array([0,1.778,2.921,3.302,6.317,9.524,10.54])
y = np.array([1,0.831763771,0.598411595,0.656145266,0.207014135,0.016218101,0.004102041])
(a,b), pcov = curve_fit(lq, x, np.log(y), p0=[0.05,0.05])
xmodel = np.arange(20) # Note: use numpy!
ymodel = np.exp(lq(xmodel, a, b)) # Note: take exp() as inverse log()
fig, ax1 = plt.subplots()
ax1.set_yscale('log')
ax1.plot(x, y, "ro", label="Experiment")
ax1.plot(xmodel,ymodel, "r--", label="Model")
plt.show()
Note that the model function is changed to just be the ax+bx^2 you wanted to write in the first place and that this is now fitting np.log(y), not y. This will give a much more satisfying fit at the smaller y values.
You might also find lmfit (https://lmfit.github.io/lmfit-py/) helpful for this problem (disclaimer: I am a lead author). With this, your fit script could become
from lmfit import Model
model = Model(lq)
params = model.make_params(a=0.05, b=0.05)
result = model.fit(np.log(y), params, x=x)
print(result.fit_report())
xmodel = np.arange(20)
ymodel = np.exp(result.eval(x=xmodel))
plt.plot(x, y, "ro", label="Experiment")
plt.plot(xmodel, ymodel, "r--", label="Model")
plt.yscale('log')
plt.legend()
plt.show()
This will print out a report including fit statistics and interpretable uncertainties and correlations between variables:
[[Model]]
Model(lq)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 7
# data points = 7
# variables = 2
chi-square = 0.16149397
reduced chi-square = 0.03229879
Akaike info crit = -22.3843833
Bayesian info crit = -22.4925630
[[Variables]]
a: -0.05212688 +/- 0.04406602 (84.54%) (init = 0.05)
b: 0.05274458 +/- 0.00479056 (9.08%) (init = 0.05)
[[Correlations]] (unreported correlations are < 0.100)
C(a, b) = -0.968
and give a plot of
Note that lmfit Parameters can be fixed or bounded and that lmfit comes with many built-in models.
Finally, if you were to include a constant term in the quadratic model, you would not really need an iterative method but could use polynomial regression, as with numpy.polyfit.
Here is a graphical Python fitter using your data with a Gompertz type of sigmoidal equation. This code uses scipy's Differential Evolution genetic algorithm module to determine initial parameter estimates for scipy's non-linear curve_fit() routine. That scipy module uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring bounds within which to search. In this example, I made all of the parameter search bounds from -2.0 to 2.0, and that seems to work in this case. Note that it is much easier to provide ranges for the initial parameter estimates than specific values, and those parameter ranges can be generous.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
#x and y are from experiment
x=[0,1.778,2.921,3.302,6.317,9.524,10.54]
y=[1,0.831763771,0.598411595,0.656145266,0.207014135,0.016218101,0.004102041]
# alias data to match previous example code
xData = numpy.array(x, dtype=float)
yData = numpy.array(y, dtype=float)
def func(x, a, b, c): # Sigmoidal Gompertz C from zunzun.com
return a * numpy.exp(b * numpy.exp(c*x))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
parameterBounds = []
parameterBounds.append([-2.0, 2.0]) # search bounds for a
parameterBounds.append([-2.0, 2.0]) # search bounds for b
parameterBounds.append([-2.0, 2.0]) # search bounds for c
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# plot wuth log Y axis scaling
plt.yscale('log')
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

How to interpolate 2D spatial data with kriging in Python?

I have a spatial 2D domain, say [0,1]×[0,1]. In this domain, there are 6 points where some scalar quantity of interest has been observed (e.g., temperature, mechanical stress, fluid density, etc.). How can I predict the quantity of interest at unobserved points? In other words, how may I interpolate spatial data in Python?
For example, consider the following coordinates for points in the 2D domain (inputs) and corresponding observations of the quantity of interest (outputs).
import numpy as np
coordinates = np.array([[0.0,0.0],[0.5,0.0],[1.0,0.0],[0.0,1.0],[0.5,1.],[1.0,1.0]])
observations = np.array([1.0,0.5,0.75,-1.0,0.0,1.0])
The X and Y coordinates can be extracted with:
x = coordinates[:,0]
y = coordinates[:,1]
The following script creates a scatter plot where yellow (resp. blue) represents high (resp. low) output values.
import matplotlib.pyplot as plt
fig = plt.figure()
plt.scatter(x, y, c=observations, cmap='viridis')
plt.colorbar()
plt.show()
I would like to use Kriging to predict the scalar quantity of interest on a regular grid within the 2D input domain. How can I do this in Python?
In OpenTURNS, the KrigingAlgorithm class can estimate the hyperparameters of a Gaussian process model based on the known output values at specific input points. The getMetamodel method of KrigingAlgorithm, then, returns a function which interpolates the data.
First, we need to convert the Numpy arrays coordinates and observations to OpenTURNS Sample objects:
import openturns as ot
input_train = ot.Sample(coordinates)
output_train = ot.Sample(observations, 1)
The array coordinates has shape (6, 2), so it is turned into a Sample of size 6 and dimension 2. The array observations has shape (6,), which is ambiguous: Is it going to be a Sample of size 6 and dimension 1, or a Sample of size 1 and dimension 6? To clarify this, we specify the dimension (1) in the call to the Sample class constructor.
In the following, we define a Gaussian process model with constant trend function and squared exponential covariance kernel:
inputDimension = 2
basis = ot.ConstantBasisFactory(inputDimension).build()
covariance_kernel = ot.SquaredExponential([1.0]*inputDimension, [1.0])
algo = ot.KrigingAlgorithm(input_train, output_train,
covariance_kernel, basis)
We then fit the value of the trend and the parameters of the covariance kernel (amplitude parameter and scale parameters) and obtain a metamodel:
# Fit
algo.run()
result = algo.getResult()
krigingMetamodel = result.getMetaModel()
The resulting krigingMetamodel is a Function which takes a 2D Point as input and returns a 1D Point. It predicts the quantity of interest. To illustrate this, let us build the 2D domain [0,1]×[0,1] and discretize it with a regular grid:
# Create the 2D domain
myInterval = ot.Interval([0.0, 0.0], [1.0, 1.0])
# Define the number of interval in each direction of the box
nx = 20
ny = 10
myIndices = [nx - 1, ny - 1]
myMesher = ot.IntervalMesher(myIndices)
myMeshBox = myMesher.build(myInterval)
Using our krigingMetamodel to predict the values taken by the quantity of interest on this mesh can be done with the following statements. We first get the vertices of the mesh as a Sample, and then evaluate the predictions with a single call to the metamodel (there is no need for a for loop here):
# Predict
vertices = myMeshBox.getVertices()
predictions = krigingMetamodel(vertices)
In order to see the result with Matplotlib, we first have to create the data required by the pcolor function:
# Format for plot
X = np.array(vertices[:, 0]).reshape((ny, nx))
Y = np.array(vertices[:, 1]).reshape((ny, nx))
predictions_array = np.array(predictions).reshape((ny,nx))
The following script produces the plot:
# Plot
import matplotlib.pyplot as plt
fig = plt.figure()
plt.pcolor(X, Y, predictions_array)
plt.colorbar()
plt.show()
We see that the predictions of the metamodel are equal to the observations at the observed input points.
This metamodel is a smooth function of the coordinates: its smoothness increases with covariance kernel smoothness and squared exponential covariance kernels happen to be smooth.

plot_decision_regions with error "Filler values must be provided when X has more than 2 training features."

I am plotting 2D plot for SVC Bernoulli output.
converted to vectors from Avg word2vec and standerdised data
split data to train and test.
Through grid search found the best C and gamma(rbf)
clf = SVC(C=100,gamma=0.0001)
clf.fit(X_train1,y_train)
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X_train, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
Receive error :-
ValueError: y must be a NumPy array. Found
also tried to convert the y to numpy. Then it prompts error
ValueError: y must be an integer array. Found object. Try passing the array as y.astype(np.integer)
finally i converted it to integer array.
Now it is prompting of error.
ValueError: Filler values must be provided when X has more than 2 training features.
You can use PCA to reduce your data multi-dimensional data to two dimensional data. Then pass the obtained result in plot_decision_region and there will be no need of filler values.
from sklearn.decomposition import PCA
from mlxtend.plotting import plot_decision_regions
clf = SVC(C=100,gamma=0.0001)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(X_train)
clf.fit(X_train2, y_train)
plot_decision_regions(X_train2, y_train, clf=clf, legend=2)
plt.xlabel(X.columns[0], size=14)
plt.ylabel(X.columns[1], size=14)
plt.title('SVM Decision Region Boundary', size=16)
I've spent some time with this too as plot_decision_regions was then complaining ValueError: Column(s) [2] need to be accounted for in either feature_index or filler_feature_values and there's one more parameter needed to avoid this.
So, say, you have 4 features and they come unnamed:
X_train_std.shape[1] = 4
We can refer to each feature by their index 0, 1, 2, 3. You only can plot 2 features at a time, say you want 0 and 2.
You'll need to specify one additional parameter (to those specified in #sos.cott's answer), feature_index, and fill the rest with fillers:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
feature_index=[0,2], #these one will be plotted
filler_feature_values={1: value, 3:value}, #these will be ignored
filler_feature_ranges={1: width, 3: width})
You can just do (Assuming X_train and y_train are still panda dataframes) for the numpy array problem.
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
For the filler_feature issue, you have to specify the number of features so you do the following:
value=1.5
width=0.75
fig = plot_decision_regions(X_train.values, y_train.values, clf=clf,
filler_feature_values={2: value, 3:value, 4:value},
filler_feature_ranges={2: width, 3: width, 4:width},
legend=2, ax=ax)
You need to add one filler feature for each feature you have.

How to normalize time series data with multiple features by using sklearn?

For data with the shape (num_samples,features), MinMaxScaler from sklearn.preprocessing can be used to normalize it easily.
However, when using the same method for time series data with the shape (num_samples, time_steps,features), sklearn will give an error.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
#Making artifical time data
x1 = np.linspace(0,3,4).reshape(-1,1)
x2 = np.linspace(10,13,4).reshape(-1,1)
X1 = np.concatenate((x1*0.1,x2*0.1),axis=1)
X2 = np.concatenate((x1,x2),axis=1)
X = np.stack((X1,X2))
#Trying to normalize
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X) <--- error here
ValueError: Found array with dim 3. MinMaxScaler expected <= 2.
This post suggests something like
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
Yet, it only works for data with only 1 feature. Since my data has more than 1 feature, this method doesn't work.
How to normalize time series data with multiple features?
To normalize a 3D tensor of shape (n_samples, timesteps, n_features) use the following:
(timeseries-timeseries.min(axis=2))/(timeseries.max(axis=2)-timeseries.min(axis=2))
Using the argument axis=2 will return the result of the tensor operation performed along the 3rd dimension i.e., the feature axis. Thus each feature will be normalized independently.

sklearn - label points of PCA

I am generating a PCA which uses scikitlearn, numpy and matplotlib. I want to know how to label each point (row in my data). I found "annotate" in matplotlib, but this seems to be for labeling specific coordinates, or just putting text on arbitrary points by the order they appear. I'm trying to abstract away from this but struggling due to the PCA sections that appear before the matplot stuff. Is there a way I can do this with sklearn, while I'm still generating the plot, so I don't lose its connection to the row I got it from?
Here's my code:
# Create a Randomized PCA model that takes two components
randomized_pca = decomposition.RandomizedPCA(n_components=2)
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(x)
# Create a regular PCA model
pca = decomposition.PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(x)
# Inspect the shape
reduced_data_pca.shape
# Print out the data
print(reduced_data_rpca)
print(reduced_data_pca)
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
colors = ['red', 'blue']
for i in range(len(colors)):
w = reduced_data_pca[:, 0][y == i]
z = reduced_data_pca[:, 1][y == i]
plt.scatter(w, z, c=colors[i])
targ_names = ["Negative", "Positive"]
plt.legend(targ_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()
PCA is a projection, not a clustering (you tagged this as clustering).
There is no concept of a label in PCA.
You can draw texts onto a scatterplot, but usually it becomes too crowded. You can find answers to this on stackoverflow already.

Resources