I want to use Gaussian Mixture models to find the centers of multimodal distributions that look something like this:
To this end I want to use sklearn.mixture.GaussianMixture. This code regresses a mixture of Gaussian distributions to data. The way this is usually done like this:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import mixture
n_samples = 300
# generate random sample, two components
np.random.seed(0)
# generate spherical data centered on (20, 20)
shifted_gaussian = np.random.randn(n_samples, 2) + np.array([20, 20])
# generate zero centered stretched Gaussian data
C = np.array([[0., -0.7], [3.5, .7]])
stretched_gaussian = np.dot(np.random.randn(n_samples, 2), C)
# concatenate the two datasets into the final training set
X_train = np.vstack([shifted_gaussian, stretched_gaussian])
# fit a Gaussian Mixture Model with two components
clf = mixture.GaussianMixture(n_components=2, covariance_type='full')
clf.fit(X_train)
The point is, that the data is given as a list of 2D points that form a Gaussian cloud. My data is a little different - more like weighted x,y points. Given my image, I could do something like this:
import numpy, cv2
image = cv2.imread("double_blob.jpg")
xs, ys = np.meshgrid(list(range(image.shape[0])), list(range(image.shape[1])))
xs, ys = xs.flatten(), ys.flatten()
weights = image[xs, ys].flatten()
to get a list of x,y image coordinates and weights. But I don't know how I can feed this to the GaussianMixture function. Any ideas?
I have found a 'cheat' way of doing it:
from sklearn.mixture import GaussianMixture
data = cv.imread("dual_blob.jpg")
data = cv.normalize(data, None, 0, 255, cv.NORM_MINMAX)
gmm = GaussianMixture(n_components=2)
xs, ys = np.meshgrid(list(range(glint_size*2)), list(range(glint_size*2)))
xs, ys = xs.flatten(), ys.flatten()
gmm_data = [
np.array([[x, y]] * int(data[x, y])).transpose()
if int(data[x, y]) > 0
else -np.ones((2, 1))
for x, y in zip(xs, ys)
]
gmm_data = np.concatenate(gmm_data, axis=1)
gmm_data = gmm_data[gmm_data >= 0]
gmm_data = gmm_data.reshape(2, gmm_data.shape[0] // 2).transpose()
print(gmm_data)
gmm.fit(gmm_data)
centers = gmm.means_
Basically what it does is normalise the image to between 0 and 255. Then it goes over every pixel and creates as many points of that coordinate as the image value at that pixel. So if the pixel at [3, 7] has a value of 10, then [[3, 7], [3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7]] gets added to the list of points for processing. This gives:
However this solution is so ugly. So I'm definitely keen to see if anyone has something better.
Related
I would like to plot points with 100 parameters each with values between 0-99 on a 2 dimensional plot. This should be straightforward with normal methods of dimensionality reduction (PCA/tSNE/UMAP etc) but I need to be able to add subsequent points to the plot without it needing to recalculate and therefore change
I am picturing an algorithm that takes a data-point with it's 100 values and converts it to X,Y coordinates that can then be plotted. Points proximal in the 2D projection are proximal in the original 100D space. Does such an algorithm exist? If not, any alternative approaches?
Thanks
I am not sure I understood the question correctly but with an initial set X, we can fit a PCA to compute the principal components. Then, we can use these principal components to transform new samples.
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
n_samples, n_feats = 50, 100
X = np.random.randint(0, 99, size=n_samples * n_feats).reshape(n_samples, n_feats)
pca = PCA(n_components=2).fit(X)
X_reduced = pca.transform(X)
plt.scatter(X[:, 0], X[:, 1])
This plots,
Then, when a new sample comes in
new_sample = np.random.randint(0, 99, size=100).reshape(1, 100)
new_sample_reduced = pca.transform(new_sample)
plt.scatter(new_sample_reduced[:, 0], new_sample_reduced[:, 1], color="red")
We can plot it
I am trying to generate random points uniformly distributed over rectangular region centered at (0,0) using numpy and matplotlib.
The words, random points uniformly distributed is not easy to interprete. This is one of my interpretation shown as a runnable code. Sample output plot is also given.
import matplotlib.pyplot as plt
import numpy as np
# create array of meshgrid over a rectangular region
# range of x: -cn/2, cn/2
# range of y: -rn/2, rn/2
cn, rn = 10, 14 # number of columns/rows
xs = np.linspace(-cn/2, cn/2, cn)
ys = np.linspace(-rn/2, rn/2, rn)
# meshgrid will give regular array-like located points
Xs, Ys = np.meshgrid(xs, ys) #shape: rn x cn
# create some uncertainties to add as random effects to the meshgrid
mean = (0, 0)
varx, vary = 0.007, 0.008 # adjust these number to suit your need
cov = [[varx, 0], [0, vary]]
uncerts = np.random.multivariate_normal(mean, cov, (rn, cn))
# plot the random-like meshgrid
plt.scatter(Xs+uncerts[:,:,0], Ys+uncerts[:,:,1], color='b');
plt.gca().set_aspect('equal')
plt.show()
You can change the values of varx and vary to change the level of randomness of the dot array on the plot.
As #JohanC mentioned in comments, you need points with x and y coordinates between -20 and 20. To create them use:
np.random.uniform(-20, 20, size=(n,2))
with n being your desired number of points.
To plot them:
import matplotlib.pyplot as plt
plt.scatter(a[:,0],a[:,1])
sample plot for n=100 points:
I am trying to get contourf to plot my stuff right, but it seems to switch the x and y coordinates. In the example below, I show this by evaluating a 2d Gaussian function that has different widths in x and y directions. With the values given, the width in y direction should be larger. Here is the script:
from numpy import *
from matplotlib.pyplot import *
xMax = 50
xNum = 100
w0x = 10
w0y = 15
dx = xMax/xNum
xGrid = linspace(-xMax/2+dx/2, xMax/2-dx/2, xNum, endpoint=True)
yGrid = xGrid
Int = zeros((xNum, xNum))
for idX in range(xNum):
for idY in range(xNum):
Int[idX, idY] = exp(-((xGrid[idX]/w0x)**2 + (yGrid[idY]/(w0y))**2))
fig = figure(6)
clf()
ax = subplot(2,1,1)
X, Y = meshgrid(xGrid, yGrid)
contour(X, Y, Int, colors='k')
plot(array([-xMax, xMax])/2, array([0, 0]), '-b')
plot(array([0, 0]), array([-xMax, xMax])/2, '-r')
ax.set_aspect('equal')
xlabel("x")
ylabel("y")
subplot(2,1,2)
plot(xGrid, Int[:, int(xNum/2)], '-b', label='I(x, y=max/2)')
plot(xGrid, Int[int(xNum/2), :], '-r', label='I(x=max/2, y)')
ax.set_aspect('equal')
legend()
xlabel(r"x or y")
ylabel(r"I(x or y)")
The figure thrown out is this:
On top the contour plot which has the larger width in x direction (not y). Below are slices shown, one across x direction (at constant y=0, blue), the other in y direction (at constant x=0, red). Here, everything seems fine, the y direction is broader than the x direction. So why would I have to transpose the array in order to have it plotted as I want? This seems unintuitive to me and not in agreement with the documentation.
It helps if you think of a 2D array's shape not as (x, y) but as (rows, columns), because that is how most math routines interpret them - including matplotlib's 2D plotting functions. Therefore, the first dimension is vertical (which you call y) and the second dimension is horizontal (which you call x).
Note that this convention is very prominent, even in numpy. The function np.vstack is supposed to concatenate arrays vertically works along the first dimension and np.hstack works horizontally on the second dimension.
To illustrate the point:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([[0, 0, 1, 0, 0],
[0, 1, 1, 1, 0],
[1, 1, 1, 1, 1]])
a[:, 2] = 2 # set column
print(a)
plt.imshow(a)
plt.contour(a, colors='k')
This prints
[[0 0 2 0 0]
[0 1 2 1 0]
[1 1 2 1 1]]
and consistently plots
According to your convention that an array is (x, y) the command a[:, 2] = 2 should have assigned to the third row, but numpy and matplotlib both agree that it was the column :)
You can of course use your own convention how to interpret the dimensions of your arrays, but in the long run it will be more consistent to treat them as (y, x).
I have a data set "x" and its label vector "y". I want to plot the accuracy for each attribute (for each column of "x") after applying NaiveBayes and cross-validation. I want a bar graph.
So at the end I need to have 3 bars, because "x" has 3 columns. And the classification has to run 3 times. 3 different accuracies for each feature.
Whenever I execute my code it shows:
ValueError: Found arrays with inconsistent numbers of samples: [1 3]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
What am I doing wrong?
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
x = np.array([[0, 0.51, 0.00101], [3, 0.54, 0.00105], [6, 0.57, 0.00108], [9, 0.60, 0.00111], [1, 0.73, 0.00114], [5, 0.76, 0.00117], [8, 0.89, 120]])
y = np.array([1, 0, 0, 1, 1, 1, 0])
scores = list()
scores_std = list()
for i in range(x.shape[1]):
xA=x[:, i]
scoresKF2 = cross_validation.cross_val_score(clf, xA, y, cv=2)
scores.append(np.mean(scoresKF2))
scores_std.append(np.std(scoresKF2))
plt.bar(x[:,i], scores)
plt.show()
Checking the shape of your input data, xA, shows us that it is 1-dimensional -- specifically, it is (7,) shape. As the warning tells us, you are not allowed to pass in a 1d array here. The key to solving this in the warning that was returned Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Therefore, since it is just a single feature, do this xA = x[:,i].reshape(-1, 1) instead of xA = x[:,i].
I think there is another issue with the plotting. I'm not completely sure what you are expecting to see but you should probably replace plt.bar(x[:,i], scores) with plt.bar(i, np.mean(scoresKF2)).
I'm having some trouble understanding sckit-learn's LogisticRegression() method. Here's a simple example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)
Age ZepplinFan
0 13 0
1 25 0
2 40 1
3 51 0
4 55 1
5 58 1
# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])
# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826
# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]
# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()
Obviously this doesn't appear to be a good fit. Furthermore, when I do this exercise in R I get different coefficients and a model that makes more sense.
lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]
mylogit$coeff
(Intercept) Age
-4.8434 0.1148
ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))
What am I missing here?
The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. The parameter C is used to control the regularization, in your case:
lr = LogisticRegression(C=100)
will generate what you are looking for:
As you have discovered, changing the value of the intercept_scaling parameter also achieves similar effect. The reason is also regularization or rather how it affects estimation of the bias in the regression. The larger intercept_scaling parameter will effectively reduce the impact of regularization on the bias.
For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression