Modifying old GaussianProcessor example to run with GaussianProcessRegressor - python-3.x

I have an example from a data science book I am trying to run in a Jupyter notebook. The code sippet looks like this
from sklearn.gaussian_process import GaussianProcess
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)
# Compute the Gaussian process fit
gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
since GaussianProcess has been deprecated and replaced with GaussianProcessRegressor. I tried to fix the code snippet to look like this
from sklearn.gaussian_process import GaussianProcessRegressor
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)
# Compute the Gaussian process fit
gp = GaussianProcessRegressor(random_state=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis])
dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
but I get a ValueError
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-47-c04ac57d1897> in <module>
11
12 xfit = np.linspace(0, 10, 1000)
---> 13 yfit, MSE = gp.predict(xfit[:, np.newaxis])
14 dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
ValueError: too many values to unpack (expected 2)
bit unsure why the predict function complains here?

The error has the answer.
At yfit, MSE = gp.predict(xfit[:, np.newaxis]) you are trying to assign the result of predict to two variables while the predict only returns a single numpy.ndarray.
To solve this issue, run
yfit = gp.predict(xfit[:, np.newaxis])

Related

PyMC3 combining a deterministic with gaussian process

I'm trying to run a forecast on data that appears to have two components to it. A curve that looks like an exponential decay curve with a seasonal overlay on top. The attached image shows a sample of the simulated data.
What I've done so far is the following
I can do a gaussian process regression using the following model. It tries to find a switch point and tries to fit a linear and seasonal component pre/post that point. This would have worked just by itself with the available data, but in my real data I suspect there is an exponential like trend in the data.
switchpoint = pm.DiscreteUniform("switchpoint", lower=40, upper=60, testval=50)
ls_1 = pm.Gamma(name="ls_1", alpha=1.0, beta=0.5)
ls_2 = pm.Gamma(name="ls_2", alpha=1.0, beta=0.5)
period_1 = pm.Gamma(name="period_1", alpha=12, beta=2)
period_2 = pm.Gamma(name="period_2", alpha=12, beta=2)
ls_switched = pm.math.switch(switchpoint < x_switch, ls_1, ls_2)
period_switched = pm.math.switch(switchpoint < x_switch, period_1, period_2)
gp_1 = pm.gp.Marginal(
cov_func=pm.gp.cov.Periodic(input_dim=1, period=period_switched, ls=ls_switched)
)
# Linear trend.
c_31 = pm.Normal(name="c_31", mu=0, sigma=2)
c_32 = pm.Normal(name="c_32", mu=0, sigma=2)
c_switched = pm.math.switch(switchpoint < x_switch, c_31, c_32)
gp_3 = pm.gp.Marginal(cov_func=pm.gp.cov.Linear(1, c=c_switched))
ls_switched = pm.math.switch(switchpoint < x_switch, ls_1, ls_2)
# Define gaussian process.
gp = gp_1 + gp_3
# Noise.
sigma = pm.HalfNormal("sigma", sigma=2)
# Likelihood.
y_pred = gp.marginal_likelihood(
"y_pred",
X=x_train.reshape(n_train, 1),
y=y_train.reshape(n_train, 1).flatten(),
noise=sigma,
)
I want to overlay that using the following model. It seeks to fit an exponential curve on data.
amp = pm.Uniform("amp", 0.05, 0.4)
size = pm.Uniform("size", 0.5, 2.5)
ps = pm.Normal("ps", 0.13, 40)
x_pred = np.linspace(0, 70, 1)
z = pm.Deterministic(
"z",
amp
* np.exp(
-1
* (np.pi**2 * size * x_pred / (3600.0 * 180.0)) ** 2
/ (4.0 * np.log(2.0))
)
+ ps,
)
y = pm.Normal("y", mu=z + gp, tau=1.0, observed=y_act)
Basically my generative process is something that decays like an exponential function but has seasonality overlaid on it. But this is the point where I'm stuck. How do I tell pymc3 that I want to run samples from an overlay of the two processes ?
It gives the following error
Traceback (most recent call last):
File "./t.py", line 60, in <module>
y = pm.Normal("y", mu=z + gp, tau=1.0, observed=y_act)
TypeError: unsupported operand type(s) for +: 'DeterministicWrapper' and 'Marginal'

numpy condition function for 2-D data

I have a synthetic dataset consisting of features (X) and labels (y) which is used for KMeans clustering using Python 3.8 and sklearn 0.22.2 and numpy 1.19.
X.shape, y.shape
# ((100, 2), (100,))
kmeans = KMeans(n_clusters = 3, init = 'random', n_init = 10, max_iter = 300)
# Train model on scaled features-
kmeans.fit(X)
After training KMeans on 'X', I want to replace the unique (continuous) values of 'X' with the cluster centers (discreet) obtained using KMeans.
for i in range(3):
print("cluster number {0} has center = {1}".format(i + 1, kmeans.cluster_centers_[i, :]))
'''
cluster number 1 has center = [-0.7869159 1.14173859]
cluster number 2 has center = [ 1.28010442 -1.04663318]
cluster number 3 has center = [-0.54654735 0.0054752 ]
'''
set(kmeans.labels_)
# {0, 1, 2}
One way I have of doing it is:
X[np.where(clustered_labels == 0)] = val[0,:]
X[np.where(clustered_labels == 1)] = val[1,:]
X[np.where(clustered_labels == 2)] = val[2,:]
Can I do it using np.select()?
cond = [clustered_labels == i for i in range(3)]
val = kmeans.cluster_centers_[:,:]
But on executing the code:
np.select(cond, val)
I get the following error:
--------------------------------------------------------------------------- ValueError Traceback (most recent call
last) in
----> 1 np.select(cond, val)
<array_function internals> in select(*args, **kwargs)
~/.local/lib/python3.8/site-packages/numpy/lib/function_base.py in
select(condlist, choicelist, default)
693 result_shape = condlist[0].shape
694 else:
--> 695 result_shape = np.broadcast_arrays(condlist[0], choicelist[0])[0].shape
696
697 result = np.full(result_shape, choicelist[-1], dtype)
<array_function internals> in broadcast_arrays(*args, **kwargs)
~/.local/lib/python3.8/site-packages/numpy/lib/stride_tricks.py in
broadcast_arrays(subok, *args)
256 args = [np.array(_m, copy=False, subok=subok) for _m in args]
257
--> 258 shape = _broadcast_shape(*args)
259
260 if all(array.shape == shape for array in args):
~/.local/lib/python3.8/site-packages/numpy/lib/stride_tricks.py in
_broadcast_shape(*args)
187 # use the old-iterator because np.nditer does not handle size 0 arrays
188 # consistently
--> 189 b = np.broadcast(*args[:32])
190 # unfortunately, it cannot handle 32 or more arguments directly
191 for pos in range(32, len(args), 31):
ValueError: shape mismatch: objects cannot be broadcast to a single
shape
Suggestions?
Thanks!
Somewhat cleaner way to do it (but very similar to your way) will be the following. Here's a simple example:
from sklearn.cluster import KMeans
import numpy as np
x1 = np.random.normal(0, 2, 100)
y1 = np.random.normal(0, 1, 100)
label1 = np.ones(100)
d1 = np.column_stack([x1, y1, label1])
x2 = np.random.normal(3, 1, 100)
y2 = np.random.normal(1, 2, 100)
label2 = np.ones(100) * 2
d2 = np.column_stack([x2, y2, label2])
x3 = np.random.normal(-3, 0.5, 100)
y3 = np.random.normal(0.5, 0.25, 100)
label3 = np.ones(100) * 3
d3 = np.column_stack([x3, y3, label3])
D = np.row_stack([d1, d2, d3])
np.random.shuffle(D)
X = D[:, :2]
y = D[:, 2]
print(f'X.shape = {X.shape}, y.shape = {y.shape}')
# X.shape = (300, 2), y.shape = (300,)
kmeans = KMeans(n_clusters = 3, init = 'random', n_init = 10, max_iter = 300)
# Train model on scaled features-
kmeans.fit(X)
preds = kmeans.predict(X)
X[preds==0] = kmeans.cluster_centers_[0]
X[preds==1] = kmeans.cluster_centers_[1]
X[preds==2] = kmeans.cluster_centers_[2]
Yet another way to accomplish the task is to use the np.put method instead of the assignment as follows:
np.put(X, preds==0, kmeans.cluster_centers_[0])
np.put(X, preds==1, kmeans.cluster_centers_[1])
np.put(X, preds==2, kmeans.cluster_centers_[2])
Frankly, I don't see a way to accomplish the task by the means of the np.select function, and I guess the way you do it is the best way, based on this answer.
Cheers.

Poor GMM fit in sklearn from 2 gaussian

I want to fit a 2 component mixture model with sklearn for then calculating back posterior probability. Butwith the code I have so far the fit for one of the two distributions is perfect (overfitting?) and other one is very poor. I made a dummy example with sampling 2 gaussian
import numpy as np
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
def calc_pdf():
"""
calculate gauss mixture modelling for 2 comp
return pdfs
"""
d = np.random.normal(-0.1, 0.07, 5000)
t = np.random.normal(0.2, 0.13, 10000)
pool = np.concatenate([d, t]).reshape(-1,1)
label = ['d']*d.shape[0] + ['t'] * t.shape[0]
X = pool[pool>0].reshape(-1,1)
X = np.log(X)
clf = GaussianMixture(
n_components=2,
covariance_type='full',
tol = 1e-24,
max_iter = 1000
)
logprob = clf.fit(X).score_samples(X)
responsibilities = clf.predict_proba(X)
pdf = np.exp(logprob)
pdf_individual = responsibilities * pdf[:, np.newaxis]
plot_gauss(np.log(d), np.log(t), pdf_individual, X)
return pdf_individual[0], pdf_individual[1]
def plot_gauss(d, t, pdf_individual, x):
fig, ax = plt.subplots(figsize=(12, 9), facecolor='white')
ax.hist(d, 30, density=True, histtype='stepfilled', alpha=0.4)
ax.hist(t, 30, density=True, histtype='stepfilled', alpha=0.4)
ax.plot(x, pdf_individual, '.')
ax.set_xlabel('$x$')
ax.set_ylabel('$p(x)$')
plt.show()
calc_pdf()
which produces this plot here
Is there something obvious that I am missing?

TypeError: 'type' object is not subscriptable during clustering

I am implementing the KMeans algorithm for clustering and i get this problem and its not working in jupyter platform. I am applying elbow method to find the optimal number of clusters.
#Now find the optimal number of clusters using elbow method
from sklearn.cluster import KMeans
wcss = []
for i in range[1,11]:
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-ebfededa579e> in <module>()
2 from sklearn.cluster import KMeans
3 wcss = []
----> 4 for i in range[1,11]:
5 kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
6 kmeans.fit(X)
TypeError: 'type' object is not subscriptable
The error says (or tries to say) that range is a method. Therefore you need to call it like this: range(1, 11) instead of range[1, 11].
If you change this in the 4th line it should work (at least this part).

Issue when trying to plot after applying PCA on a dataset

I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.

Resources