How To Plot n Furthest Points From Each Centroid KMeans - python-3.x

I am trying to train a kmeans model on the iris dataset in Python.
Is there a way to plot n furthest points from each centroid using kmeans in Python?
Here is a fully working code:
from sklearn import datasets
from sklearn.cluster import KMeans
import numpy as np
# import iris dataset
iris = datasets.load_iris()
X = iris.data[:, 2:5] # use two variables
# plot the two variables to check number of clusters
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1])
# kmeans
km = KMeans(n_clusters = 2, random_state = 0) # Chose two clusters
y_pred = km.fit_predict(X)
X_dist = kmeans.transform(X) # get distances to each centroid
## Stuck at this point: How to make a function that extracts three points that are furthest from the two centroids
max3IdxArr = []
for label in np.unique(km.labels_):
X_label_indices = np.where(y_pred == label)[0]
# max3Idx = X_label_indices[np.argsort(X_dist[:3])] # This part is wrong
max3Idx = X_label_indices[np.argsort(X_dist[:3])] # This part is wrong
max3IdxArr.append(max3Idx)
max3IdxArr
# plot
plt.scatter(X[:, 0].iloc[max3IdxArr], X[:, 1].iloc[max3IdxArr])

what you did is np.argsort(X_dist[:3])
which already takes top three values from the unsorted X_dist hence you can
try taking x=np.argsort(x_dist) and
after sorting is done you could then try
x[:3]
feel free to ask,
if this isnt working
cheers

Related

Can Principal Component Analysis be applied on 2D geometry with X and Y nodes in Python?

Aim of the task: I have sets of coordinates (X and Y) coordinates of the geometry and I want to make my geometry aligned. The coordinate and respective geometry is shown in the picture.
X1_coordinate = [0.0, 0.87, 1.37, 1.87, 2.73, 3.6, 4.46, 4.96, 5.46, 4.6, 3.73, 2.87, 2.0, 1.5, 1.0, 0.5, 2.37, 3.23, 4.1]
Y1_coordinate = [0.0, 0.5, -0.37, -1.23, -0.73, -0.23, 0.27, -0.6, -1.46, -1.96, -2.46, -2.96, -3.46, -2.6, -1.73, -0.87, -2.1, -1.6, -1.1]
Question: Can I apply Principal Component Analysis on 2D geometries to make it aligned such that its principal axis will be parallel to the reference axis (X and Y)?
Expected output: I want my geometry like this. This is just an example. I want my geometry in such as way that, principal axis of geometry lies on the reference axis or be parallel to reference axis.
What I tried: I tried below code to implement PCA and to obtain the geometry aligned.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
plt.style.use('ggplot')
# Load the data
# iris = datasets.load_iris()
X = X1_coordinate
y = Y1_coordinate
# Z-score the features
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
# The PCA model
pca = PCA(n_components=2) # estimate only 2 PCs
X_new = pca.fit_transform(X) # project the original data into the PCA space
However, after running the code, I got error as mentioned below.
Kindly let me know what should I do to make my geometry aligned. Looking forward to get answers.
Basically, you can apply PCA to this task.
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
circle_pts = sklearn.datasets.make_circles() #get two circles with labels
circle_pts = circle_pts[0][circle_pts[1]==0] #leave only one circle
ang = 63/180*np.pi #radians of rotation
R = np.array([[np.cos(ang), -np.sin(ang)], [np.sin(ang), np.cos(ang)]])
ellipse_pts = circle_pts
ellipse_pts[:,0] *= 4.5
ellipse_rot_pts = ellipse_pts # R.T
plt.figure()
plt.scatter(ellipse_rot_pts[:,0], ellipse_rot_pts[:,1])
plt.axis("equal")
plt.tight_layout()
plt.show()
scaler = StandardScaler(with_std=False)
scaler.fit(ellipse_rot_pts)
X = scaler.transform(ellipse_rot_pts)
pca = PCA(n_components=2) # estimate only 2 PCs
X_new = pca.fit_transform(X) # project the original data into the PCA space
plt.figure()
plt.scatter(X[:,0],X[:,1])
singular_values = pca.singular_values_
plt.plot([0, singular_values[0]*pca.components_[0,0]], [0, singular_values[0]*pca.components_[0,1]])
plt.plot([0, singular_values[1]*pca.components_[1,0]], [0, singular_values[1]*pca.components_[1,1]])
plt.axis("equal")
plt.show()
plt.figure()
plt.title("Aligned with axis figure")
plt.scatter(X_new[:,0],X_new[:,1])
plt.axis("equal")
plt.show()
But the problem is that not every geometry is appropriate for this. ellipse has two main axis of symmetry. Your figure for example doesn't. So principal components that are been seeking via maximum variance in data doesn't correspond with your example(expected output) axis alignement.
For example your set of points correspond to this variant of components alignment:
Your geometry
And for a modificated little more symmetric object.
A little more symmetrical figure
Hope i helped

Local Outlier Factor only calculated for some points (scikitLearn)

I have a large csv file, containing 2 columns representing the result of k-means clustering. I calculated 11 centroids, and the csv-file contains which one is the closest and which distance the point has towards this centroid.
The entries look like:
K11-closest,K11-distance
0,31544.821603570384
0,31494.23348984612
0,31766.471900874752
0,31710.896696452823
Then I want to calculate and plot the LOF using a script I found on scikit-learn.org
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
dataset = pd.read_csv('0.csv')
clf = LocalOutlierFactor(n_neighbors=20)
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).
y_pred = clf.fit_predict(dataset)
X_scores = clf.negative_outlier_factor_
plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset.iloc[:, 0], dataset.iloc[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset.iloc[:, 0].values, dataset.iloc[:, 1].values, s=50 * radius, edgecolors='r',
facecolors='none', label='Outlier scores')
plt.show()
But the plot shows:
With black points being the date points and red is a circle, showing how much it is an outlier
So I assume the LOF is not calculated for every point. But why? And how I calculate it for every point? And make it visible in the plot
normalising the data will help you in making more visible graphs and as per your code you have taken multipier of radius as 50 and I have taken 1000.
As we can see the algorithm does not mark red circle for every data point and it also depends on nearest neighbours(n_neighbors) we are taking into account fro algo to mark the circles.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dataset = pd.DataFrame(data=[[0, 31544.821603570384], [0,31494.23348984612], \
[0,31766.471900874752], [0,31710.896696452823]], \
columns=["K11-closest","K11-distance"])
dataset = scaler.fit_transform(dataset)
clf = LocalOutlierFactor(n_neighbors=3)
y_pred = clf.fit_predict(dataset)
X_scores = clf.negative_outlier_factor_
plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset[:, 0], dataset[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset[:, 0], dataset[:, 1], s=1000 * radius, edgecolors='r',
facecolors='none', label='Outlier scores')
legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

How to plot clusters produced by KMeans using matplotlib?

I used KMeans for clustering as shown below, but I don't know to plot my clusters in a scatter plot.
Or like This plot too
My code is:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smileyface.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
If I understand your question correctly, I think you might be looking to do something like this? I plotted the data, coloring by label, after converting to cluster distance space.
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import pandas as pd
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smileyface.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
df = pd.DataFrame(documents) # read in your data with pd.read_csv or if in list form like above do this
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df[0].values.astype('U')) # make sure you have unicode strings [0] is the column of the sentences
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=2000, n_init=20)
Xt = model.fit_transform(X)
# things with tf-idf score
X = X.toarray()
fns = np.array(vectorizer.get_feature_names()) # feature names/ordered by index
# retrieve labels with max score
labels = model.labels_
d = []
for n in sorted(np.unique(labels)):
t = X[(labels == n)].sum(axis=0) #max tf/idf score cumulative/cluster
words = fns[t == t.max()]
d.append(",".join(words))
t = Xt.T # cluster distance space X transpose to be plotted with mpl
### plot the clusters
fig, ax = plt.subplots(1,1)
cluster_color_dict = {0:'purple', 1 :'blue'} # change these to desired colors
for i in range(len(t[0])):
ax.scatter(t[0][i], t[1][i], c= cluster_color_dict[labels[i]], edgecolors='grey', lw = 0.5, s = 200)
p1 = [] # legend patches
for i in range(2):
print i
h = ax.scatter([],[], c= cluster_color_dict[i],
edgecolors= 'grey', lw = 0.5, s = 80, label = d[i])
p1.append(h)
l1 = ax.legend(handles = p1, title= 'cluster', bbox_to_anchor = (1,1), loc = 'upper left')

How to find the point in which a regression line will intersect the OY axis?

I have a file in which I provide some data, the x and y values. My program draws the regression line of those points, but what I need now is to find the value on the OY axis, which my line will intersect if it will be elongated.
What my program does now:
I need to simply make the line longer, intersect it with the OY axis, and find the exact coordinates of that point.
My code so far:
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from sklearn.linear_model import LinearRegression
data = pd.read_csv('data.csv') # load data set
X = data.iloc[:, 0].values.reshape(-1, 1) # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
My code requires a file called "data.csv" which contains the coordinates of the given values. My example has the values:
5,0.8
10,0.7
15,0.66
20,0.493
25,0.5
30,0.21
Did you want something like this, where you can use the intercept_ attribute of your LinearRegressor object to get the y-intercept at x equal to zero:
import numpy as np
import matplotlib.pyplot as plt # To visualize
import pandas as pd # To read data
from io import StringIO
from sklearn.linear_model import LinearRegression
txtfile = StringIO("""5,0.8
10,0.7
15,0.66
20,0.493
25,0.5
30,0.21""")
data = pd.read_csv(txtfile, header=None) # load data set
X = data.iloc[:, 0].values.reshape(-1, 1) # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1) # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.plot([0, X[0]], [linear_regressor.intercept_, Y_pred[0]], c="green", linestyle='--')
ax = plt.gcf().gca()
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Output:

Wrong intercept in Spark linear regression

I am starting with Spark Linear Regression. I am trying to fit a line to a linear dataset. It seems that the intercept is not correctly adjusting, or probably I am missing something..
With intercept=False:
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=False)
This seems normal. But when I use intercept=True:
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=True)
The model that I get in the last case is exactly:
(weights=[0.0353471289751], intercept=1.0005127185289888)
I have tried with different datasets, step sizes and iterations, but always the model converges the intercept is about 1
EDIT - This is the code I am using:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
import numpy as np
import matplotlib.pyplot as plt
from pyspark import SparkContext
sc = SparkContext("local", "regression")
# Generate data
SIZE = 300
SLOPE = 0.1
BASE = -30
NOISE = 10
x = np.arange(SIZE)
delta = np.random.uniform(-NOISE,NOISE, size=(SIZE,))
y = BASE + SLOPE*x + delta
data = zip(range(len(y)), y) # zip with index
dataRDD = sc.parallelize(data)
# Normalize data
# mean = np.mean(data)
# std = np.std(data)
# dataRDD = dataRDD.map(lambda r: (r[0], (float(r[1])-mean)/std))
labeledData = dataRDD.map(lambda r: LabeledPoint(float(r[1]), [float(r[0])]))
# Create linear model
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=1000, step=0.0002, intercept=True, convergenceTol=0.000001)
print linear_model
true_vs_predicted = labeledData.map(lambda p: (p.label, linear_model.predict(p.features))).collect()
# PLOT
fig = plt.figure()
ax = fig.add_subplot(111)
ax.grid()
y_real = [x[0] for x in true_vs_predicted]
y_pred = [x[1] for x in true_vs_predicted]
plt.plot(range(len(y_real)), y_real, 'o', markersize=5, c='b')
plt.plot(range(len(y_pred)), y_pred, 'o', markersize=5, c='r')
plt.show()
This is because the number of iterations and the step size are both smaller. As a result, The trial process is ending before reaching the local optima.

Resources