Plotting new points in a subspace after dimensionality reduction - dimensionality-reduction

I would like to plot points with 100 parameters each with values between 0-99 on a 2 dimensional plot. This should be straightforward with normal methods of dimensionality reduction (PCA/tSNE/UMAP etc) but I need to be able to add subsequent points to the plot without it needing to recalculate and therefore change
I am picturing an algorithm that takes a data-point with it's 100 values and converts it to X,Y coordinates that can then be plotted. Points proximal in the 2D projection are proximal in the original 100D space. Does such an algorithm exist? If not, any alternative approaches?
Thanks

I am not sure I understood the question correctly but with an initial set X, we can fit a PCA to compute the principal components. Then, we can use these principal components to transform new samples.
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
n_samples, n_feats = 50, 100
X = np.random.randint(0, 99, size=n_samples * n_feats).reshape(n_samples, n_feats)
pca = PCA(n_components=2).fit(X)
X_reduced = pca.transform(X)
plt.scatter(X[:, 0], X[:, 1])
This plots,
Then, when a new sample comes in
new_sample = np.random.randint(0, 99, size=100).reshape(1, 100)
new_sample_reduced = pca.transform(new_sample)
plt.scatter(new_sample_reduced[:, 0], new_sample_reduced[:, 1], color="red")
We can plot it

Related

sklearn GaussianMixture on Images

I want to use Gaussian Mixture models to find the centers of multimodal distributions that look something like this:
To this end I want to use sklearn.mixture.GaussianMixture. This code regresses a mixture of Gaussian distributions to data. The way this is usually done like this:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import mixture
n_samples = 300
# generate random sample, two components
np.random.seed(0)
# generate spherical data centered on (20, 20)
shifted_gaussian = np.random.randn(n_samples, 2) + np.array([20, 20])
# generate zero centered stretched Gaussian data
C = np.array([[0., -0.7], [3.5, .7]])
stretched_gaussian = np.dot(np.random.randn(n_samples, 2), C)
# concatenate the two datasets into the final training set
X_train = np.vstack([shifted_gaussian, stretched_gaussian])
# fit a Gaussian Mixture Model with two components
clf = mixture.GaussianMixture(n_components=2, covariance_type='full')
clf.fit(X_train)
The point is, that the data is given as a list of 2D points that form a Gaussian cloud. My data is a little different - more like weighted x,y points. Given my image, I could do something like this:
import numpy, cv2
image = cv2.imread("double_blob.jpg")
xs, ys = np.meshgrid(list(range(image.shape[0])), list(range(image.shape[1])))
xs, ys = xs.flatten(), ys.flatten()
weights = image[xs, ys].flatten()
to get a list of x,y image coordinates and weights. But I don't know how I can feed this to the GaussianMixture function. Any ideas?
I have found a 'cheat' way of doing it:
from sklearn.mixture import GaussianMixture
data = cv.imread("dual_blob.jpg")
data = cv.normalize(data, None, 0, 255, cv.NORM_MINMAX)
gmm = GaussianMixture(n_components=2)
xs, ys = np.meshgrid(list(range(glint_size*2)), list(range(glint_size*2)))
xs, ys = xs.flatten(), ys.flatten()
gmm_data = [
np.array([[x, y]] * int(data[x, y])).transpose()
if int(data[x, y]) > 0
else -np.ones((2, 1))
for x, y in zip(xs, ys)
]
gmm_data = np.concatenate(gmm_data, axis=1)
gmm_data = gmm_data[gmm_data >= 0]
gmm_data = gmm_data.reshape(2, gmm_data.shape[0] // 2).transpose()
print(gmm_data)
gmm.fit(gmm_data)
centers = gmm.means_
Basically what it does is normalise the image to between 0 and 255. Then it goes over every pixel and creates as many points of that coordinate as the image value at that pixel. So if the pixel at [3, 7] has a value of 10, then [[3, 7], [3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7],[3, 7]] gets added to the list of points for processing. This gives:
However this solution is so ugly. So I'm definitely keen to see if anyone has something better.

How to connect the points from a contour of a binary image

I have a segmentation result stored in a binary image, from which i want to extract the contours. To do so, I compute the difference between the mask and the eroded mask. Hence, I am able to extract the pixels that are on the boundaries of my segmentation result. Here is a code snippet:
import numpy as np
from skimage.morphology import binary_erosion
from matplotlib import pyplot as plt
# mask is a 2D boolean np.array containing the segmentation result
contour_raw=np.logical_xor(mask,binary_erosion(mask))
contour_y,contour_x=np.where(contour_raw)
fig=plt.figure()
plt.imshow(mask)
plt.plot(contour_x,contour_y,'.r')
I end up with a collection of dots on the contours of the mask:
The troubles starts when I want to connect the dots. Doing a naive plot of the contours results of course in a disappointing results, because contour_x and contour_y are not sorted as I would like:
plt.plot(contour_x,contour_y,'--r')
And here is the result, with a focus on an arbitrary part of the figure to highlight the connection between the dots:
How is it possible to sort the contours coordinates contour_x and contour_y so that they are correctly ordered when I connect the dot? Furthermore, if my mask contains several independent connected component, I would like to obtain as many contours as there are connected components.
Thanks for your help!
Best,
I think combining a clustering and convex hull works in your case. For this example, I am generating three synthetic segments using make_blobs function and demonstrating each with a color:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
from scipy.spatial import ConvexHull, convex_hull_plot_2d
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, random_state=0, cluster_std=0.3)
plt.scatter(X[:,0], X[:,1], c=y)
Then, since segments are distributed in a two dimensional map, we can run density based clustering method to cluster them, and then by finding a convex hull around each cluster, we can find points surrounding those clusters coming with order:
# Fitting Clustering
c_alg = DBSCAN()
c_alg.fit(X)
labels = c_alg.labels_
for i in range(0, max(labels)+1):
ind = np.where(labels == i)
segment = X[ind, :][0]
hull = ConvexHull(segment)
plt.plot(segment[:, 0], segment[:, 1], 'o')
for simplex in hull.simplices:
plt.plot(segment[simplex, 0], segment[simplex, 1], 'k-')
However in your case Concave Hull should work not Convex Hull. There is a package alphashape in python that claimed to find Concave hulls in two-dimensional maps. More information here. The tricky part is to find the best alpha, but in this example, we can fit concave hulls using:
import alphashape
from descartes import PolygonPatch
fig, ax = plt.subplots()
for i in range(0, max(labels)+1):
ind = np.where(labels == i)
points = X[ind, :][0,:,:]
alpha_shape = alphashape.alphashape(points,5.0)
ax.scatter(*zip(*points))
ax.add_patch(PolygonPatch(alpha_shape, alpha=0.5))
plt.show()

Why does kmeans give exactly the same results everytime?

I have re-run kmeans 4 times and get
From other answers, I got that
Everytime K-Means initializes the centroid, it is generated randomly.
Could you please explain why the results are exactly the same each time?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
They are not the same. They are similar. K-means is an algorithm that is in a way moving centroids iteratively so that they become better and better at splitting data and while this process is deterministic, you have to pick initial values for those centroids and this is usually done at random. Random start, doesn't mean that final centroids will be random. They will converge to something relatively good and often similar.
Have a look at your code with this simple modification:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
if you have a look at exact values of those centroids, they will look like that:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
Similar, but different sets of values.
Also:
Have a look at default arguments to KMeans. There is one called n_init:
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
By default it is equal to 10. Which means every time you run k-means it actually run 10 times and picked the best result. Those best results will be even more similar, than results of a single run of k-means.
I post #AEF's comment to remove this question from unanswered list.
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.
Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used. Note that the mere presence of random_state doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle, being set.
The passed value will have an effect on the reproducibility of the results returned by the function (fit, split, or any other function like k_means). random_state’s value may be:
for reference :
https://scikit-learn.org/stable/glossary.html#term-random_state

Local Outlier Factor only calculated for some points (scikitLearn)

I have a large csv file, containing 2 columns representing the result of k-means clustering. I calculated 11 centroids, and the csv-file contains which one is the closest and which distance the point has towards this centroid.
The entries look like:
K11-closest,K11-distance
0,31544.821603570384
0,31494.23348984612
0,31766.471900874752
0,31710.896696452823
Then I want to calculate and plot the LOF using a script I found on scikit-learn.org
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
dataset = pd.read_csv('0.csv')
clf = LocalOutlierFactor(n_neighbors=20)
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).
y_pred = clf.fit_predict(dataset)
X_scores = clf.negative_outlier_factor_
plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset.iloc[:, 0], dataset.iloc[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset.iloc[:, 0].values, dataset.iloc[:, 1].values, s=50 * radius, edgecolors='r',
facecolors='none', label='Outlier scores')
plt.show()
But the plot shows:
With black points being the date points and red is a circle, showing how much it is an outlier
So I assume the LOF is not calculated for every point. But why? And how I calculate it for every point? And make it visible in the plot
normalising the data will help you in making more visible graphs and as per your code you have taken multipier of radius as 50 and I have taken 1000.
As we can see the algorithm does not mark red circle for every data point and it also depends on nearest neighbours(n_neighbors) we are taking into account fro algo to mark the circles.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dataset = pd.DataFrame(data=[[0, 31544.821603570384], [0,31494.23348984612], \
[0,31766.471900874752], [0,31710.896696452823]], \
columns=["K11-closest","K11-distance"])
dataset = scaler.fit_transform(dataset)
clf = LocalOutlierFactor(n_neighbors=3)
y_pred = clf.fit_predict(dataset)
X_scores = clf.negative_outlier_factor_
plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset[:, 0], dataset[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset[:, 0], dataset[:, 1], s=1000 * radius, edgecolors='r',
facecolors='none', label='Outlier scores')
legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

Plot 3D density plot from many 2D arrays

I am trying to plot a 3D density plot from many 2D numpy arrays of the same shape. Each [x,y] coordinate returns an intensity (how dense it is at that point).
I cannot figure out how to plot this using matplotlib
I'm able to successfully get a contour plot by just plotting one 2D array, or using imshow to get a nice slice of my density at a certain 'z' cut, but just plotting that 2D array.
I have an object: data, which when I apply the method slice() and pass in an integer from 0 to 480, I get a 2D array of that 'z' cross section:
plt.imshow(data.slice(200))
I want to be able to plot a density map by iterating over data.slice(n) for n-> 0 to 480 and plot that on a single image.
I'm not sure how to do such a thing.
If you have lots of slices that you want to view as a density map from one side, you can average over all the cells along a given axis and them view that as an image.
import numpy as np
import matplotlib.pyplot as plt
def plot_projections(d):
# project onto the appropriate plane by averaging
d_mean_0 = d.mean(axis=0)
d_mean_1 = d.mean(axis=1)
d_mean_2 = d.mean(axis=2)
plt.subplot(1, 3, 1)
plt.imshow(d.mean(axis=0), cmap='rainbow')
plt.subplot(1, 3, 2)
plt.imshow(d.mean(axis=1), cmap='rainbow')
plt.subplot(1, 3, 3)
plt.imshow(d.mean(axis=2), cmap='rainbow')
plt.show()
# random seeded 10x10x10 array
d = np.random.randint(0, 10, size=(10,10,10))
plot_projections(d)
# pack matrix with 10s along one plane
for i in range(len(d)):
d[2][i] = np.array([10,10,10,10,10,10,10,10,10,10])
plot_projections(d)

Resources