Get cluster points after KMeans in a list format - python-3.x

Suppose I clustered a data set using sklearn's K-means.
I can see the centroids easily using KMeans.cluster_centers_ but I need to get the clusters as I get centroids.
How can I do that?

You probably look for the attribute labels_.

You need to do the following (see comments in my code):
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# !! Get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform the dictionary into list
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]

It's been very long asked question so I think you already have the answer but let me post as someone can be benefited from it. We can get cluster points by just using its centroid. Scikit-learn has an attribute called cluster_centers_ which returns n_clusters and n_features. The very simple code that you can see it below that to describe the cluster center and please go through all the comments in the code.
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# Iris data
iris = datasets.load_iris()
X = iris.data
# Standardization
std_data = StandardScaler().fit_transform(X)
# KMeans clustering with 3 clusters
clf = KMeans(n_clusters = 3)
clf.fit(std_data)
# Coordinates of cluster centers with shape [n_clusters, n_features]
# As we have 3 cluster with 4 features
print("Shape of cluster:", clf.cluster_centers_.shape)
# Scatter plot to see each cluster points visually
plt.scatter(std_data[:,0], std_data[:,1], c = clf.labels_, cmap = "rainbow")
plt.title("K-means Clustering of iris data flower")
plt.show()
# Putting ndarray cluster center into pandas DataFrame
coef_df = pd.DataFrame(clf.cluster_centers_, columns = ["Sepal length", "Sepal width", "Petal length", "Petal width"])
print("\nDataFrame containg each cluster points with feature names:\n", coef_df)
# converting ndarray to a nested list
ndarray2list = clf.cluster_centers_.tolist()
print("\nList of clusterd points:\n")
print(ndarray2list)
OUTPUTS:
This is the output of the above code.

Related

Saving an array of uint8array in an file using node js

I'm trying to save an array of uint8arrays in an file using node JS, does anyone know how can I do that using fs?
Something like that:
[
Uint8Array(162) [
133, 111, 74, 131, 187, 35, 69, 135, 1, 151, 1, 1,
198, 69, 229, 121, 168, 106, 59, 104, 91, 198, 115, 218,
79, 67, 102, 212, 204, 206, 145, 126, 152, 59, 147, 114,
133, 173, 119, 91, 220, 251, 174, 57, 16, 125, 67, 180,
114, 128, 5, 72, 143, 131, 122, 35, 124, 139, 62, 93,
187, 2, 2, 251, 255, 149, 134, 6, 9, 65, 100, 100,
32, 99, 97, 114, 100, 50, 0, 9, 1, 2, 2, 4,
19, 4, 21, 14, 52, 3, 66, 4, 86, 5, 87, 26,
112, 2, 3, 0,
... 62 more items
],
Uint8Array(170) [
133, 111, 74, 131, 67, 31, 18, 227, 1, 159, 1, 1,
187, 35, 69, 135, 0, 7, 102, 52, 237, 242, 169, 192,
199, 237, 238, 142, 226, 200, 98, 129, 144, 184, 181, 198,
33, 248, 228, 223, 158, 171, 250, 192, 16, 125, 67, 180,
114, 128, 5, 72, 143, 131, 122, 35, 124, 139, 62, 93,
187, 3, 5, 251, 255, 149, 134, 6, 8, 65, 100, 100,
32, 99, 97, 114, 100, 0, 10, 1, 2, 2, 4, 17,
4, 19, 4, 21, 14, 52, 3, 66, 4, 86, 5, 87,
29, 112, 2, 3,
... 70 more items
],
Uint8Array(120) [
133, 111, 74, 131, 71, 17, 199, 138, 1, 110, 1, 67,
31, 18, 227, 131, 113, 252, 222, 71, 172, 34, 205, 40,
96, 11, 236, 242, 153, 43, 182, 4, 136, 135, 36, 6,
79, 52, 223, 123, 188, 42, 7, 16, 125, 67, 180, 114,
128, 5, 72, 143, 131, 122, 35, 124, 139, 62, 93, 187,
4, 8, 251, 255, 149, 134, 6, 11, 68, 101, 108, 101,
116, 101, 32, 99, 97, 114, 100, 0, 10, 1, 2, 2,
2, 17, 2, 19, 2, 52, 1, 66, 2, 86, 2, 112,
2, 113, 2, 115,
... 20 more items
],
Uint8Array(126) [
133, 111, 74, 131, 51, 54, 68, 129, 1, 116, 1, 71,
17, 199, 138, 51, 167, 78, 185, 254, 251, 73, 245, 53,
153, 253, 146, 203, 139, 250, 206, 105, 223, 51, 227, 156,
55, 3, 146, 177, 243, 235, 140, 16, 125, 67, 180, 114,
128, 5, 72, 143, 131, 122, 35, 124, 139, 62, 93, 187,
5, 9, 251, 255, 149, 134, 6, 17, 77, 97, 114, 107,
32, 99, 97, 114, 100, 32, 97, 115, 32, 100, 111, 110,
101, 0, 9, 1, 2, 2, 2, 21, 6, 52, 1, 66,
2, 86, 2, 112,
... 26 more items
]
]
I am simply using the following, for exporting into docx file:
buffer = Uint8Array(86762) [
80, 75, 3, 4, 10, 0, 0, 0, 8, 0, 79, 72,
111, 85, 212, 85, 145, 41, 234, 1, 0, 0, 43, 11,
0, 0, 19, 0, 0, 0, 91, 67, 111, 110, 116, 101,
110, 116, 95, 84, 121, 112, 101, 115, 93, 46, 120, 109,
108, 197, 150, 205, 78, 227, 48, 20, 133, 247, 60, 133,
229, 45, 106, 92, 64, 26, 141, 80, 83, 22, 252, 44,
25, 164, 233, 60, 128, 27, 223, 164, 134, 248, 71, 182,
91, 232, 219, 207, 117, 66, 67, 169, 18, 198, 154, 54,
98, 83, 41, 182,
... 86662 more items
]
fs.writeFileSync('output.docx', buffer)
I suppose restoring it will be just fs.readFileSync(pathToFile)

I have a list of integers that is called numbers in this code Print every number that is greater than 90 in python?

Here's my code :
numbers = [76, 83, 16, 69, 52, 78, 10, 77, 45, 52, 32, 17, 58, 54, 79, 72, 55, 50, 81, 74, 45, 33, 38, 10, 40, 44, 70, 81, 79, 28, 83, 41, 14, 16, 27, 38, 20, 84, 24, 50, 59, 71, 1, 13, 56, 91, 29, 54, 65, 23, 60, 57, 13, 39, 58, 94, 94, 42, 46, 58, 59, 29, 69, 60, 83, 9, 83, 5, 64, 70, 55, 89, 67, 89, 70, 8, 90, 17, 48, 17, 94, 18, 98, 72, 96, 26, 13, 7, 58, 67, 38, 48, 43, 98, 65, 8, 74, 44, 92]
while numbers>=90:
print(numbers)
Here the Output :
Traceback (most recent call last): File "main.py", line 3, in while numbers>=90: TypeError: '>=' not supported between instances of 'list' and 'int'
numbers = [76, 83, 16, 69, 52, 78, 10, 77, 45, 52, 32, 17, 58, 54, 79, 72, 55, 50, 81, 74, 45, 33, 38, 10, 40, 44, 70, 81, 79, 28, 83, 41, 14, 16, 27, 38, 20, 84, 24, 50, 59, 71, 1, 13, 56, 91, 29, 54, 65, 23, 60, 57, 13, 39, 58, 94, 94, 42, 46, 58, 59, 29, 69, 60, 83, 9, 83, 5, 64, 70, 55, 89, 67, 89, 70, 8, 90, 17, 48, 17, 94, 18, 98, 72, 96, 26, 13, 7, 58, 67, 38, 48, 43, 98, 65, 8, 74, 44, 92]
for number in numbers:
if number >= 90:
print(number)

Converting a .mat file to cv image

I have a .mat file and want to convert it into a CV image format such that I can use it for a CNN model.
I am trying to obtain an RGB/ other colored image and not gray.
I tried doing the following(below) but I get a grayscale image, but when I plot the actual mat file using matplotlib it is not grayscale. Also, the .mat file has a px_spacing array apart from the image array. I am not sure how this is helpful.
def mat_to_image(mat_image):
f = loadmat(mat_image,appendmat=True)
image = np.array(f.get('I')).astype(np.float32)
mean = image.mean()
std= image.std()
print(mean, std)
hi = np.max(image)
lo = np.min(image)
image = (((image - lo)/(hi-lo))*255).astype(np.uint8)
im = Image.fromarray(image,mode='RGB')
return im
images=mat_to_image(dir/filename)
cv_img = cv2.cvtColor(np.array(images), cv2.COLOR_GRAY2RGB)
Normally plotting the .mat file fetches a non-grayscale(RGB image)
imgplot= plt.imshow(loadmat(img,appendmat=True).get('I'))
plt.show()
Here is how the mat file looks after print(loadmat('filename'))
{'__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Mon Sep 9 11:32:54 2019',
'__version__': '1.0',
'__globals__': [],
'I': array([
[ 81, 75, 74, 75, -11, 14, 49, 37, 29, -24, -183, -349, -581, -740],
[ 51, 33, 67, 36, 1, 42, 30, 49, 47, 42, 14, -85, -465, -727],
[ 23, 31, 36, 20, 54, 70, 44, 56, 56, 79, 62, 19, -204, -595],
[ 7, 12, 36, 47, 59, 68, 74, 56, 59, 100, 74, 34, -3, -353],
[ 23, 19, 51, 87, 86, 79, 91, 76, 96, 95, 52, 51, 74, -141],
[ 18, 51, 54, 97, 93, 94, 98, 83, 119, 71, 36, 69, 50, -16],
[ -10, 5, 53, 92, 69, 87, 103, 114, 118, 77, 51, 68, 30, 0],
[ -24, 11, 74, 80, 49, 68, 106, 129, 107, 63, 57, 70, 39, -1],
[ -45, 43, 83, 69, 43, 64, 98, 108, 90, 35, 27, 55, 31, -13],
[ -9, 32, 83, 78, 66, 106, 89, 85, 58, 43, 31, 39, 28, 7],
[ 45, 35, 76, 45, 51, 84, 55, 66, 49, 41, 39, 28, 13, -7],
[ 85, 67, 61, 45, 69, 53, 23, 32, 31, -12, -34, -182, -376, -425],
[ 136, 93, 71, 54, 30, 39, 17, -21, -29, -43, -101, -514, -792, -816]
], dtype=int16),
'px_spacing': array([[0.78125]])}

Showing a list empty despite performing operations on it

Actually i need to plot all the variations occured only in the october month of 2012 so for that i am counting the 30 rows so that i can use them in xlim for plotting.
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
poll_df=pd.read_csv('http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv')
row_in=0
xlimit=[]
poll_df=poll_df[poll_df['Start Date'].str[:7] == '2012-10']
for date in poll_df['Start Date']:
if date[0:7] == '2012-10':
xlimit.append(row_in)
row_in += 1
else:
row_in+=1
print(min(xlimit))
print(max(xlimit))
But i don't understand why xlimit is coming out empty despite performing operations on it.
With a download of that URL I can load it with np.genfromtxt:
In [232]: data = np.genfromtxt('../Downloads/2012-general-election-romney-vs-oba
...: ma.csv',dtype=None,delimiter=',',names=True,invalid_raise=False,encodi
...: ng=None)
/usr/local/bin/ipython3:1: ConversionWarning: Some errors were detected !
Line #77 (got 13 columns instead of 17)
Line #238 (got 13 columns instead of 17)
Line #460 (got 18 columns instead of 17)
Line #488 (got 18 columns instead of 17)
Line #493 (got 13 columns instead of 17)
Line #507 (got 18 columns instead of 17)
Line #515 (got 18 columns instead of 17)
Line #538 (got 18 columns instead of 17)
Line #550 (got 18 columns instead of 17)
#!/usr/bin/python3
It's not quite as forgiving as pandas when dealing with shorter/longer length lines.
In [233]: data.shape
Out[233]: (577,)
In [234]: data.dtype
Out[234]: dtype([('Pollster', '<U56'), ('Start_Date', '<U10'), ('End_Date', '<U10'), ('Entry_DateTime_ET', '<U20'), ('Number_of_Observations', '<i8'), ('Population', '<U26'), ('Mode', '<U15'), ('Obama', '<f8'), ('Romney', '<f8'), ('Undecided', '<f8'), ('Other', '<f8'), ('Pollster_URL', '<U113'), ('Source_URL', '<U189'), ('Partisan', '<U11'), ('Affiliation', '<U5'), ('Question_Text', '?'), ('Question_Iteration', '<i8')])
The start_date field looks like:
In [235]: data['Start_Date'][:10]
Out[235]:
array(['2012-11-04', '2012-11-03', '2012-11-03', '2012-11-03',
'2012-11-03', '2012-11-03', '2012-11-03', '2012-11-01',
'2012-11-02', '2012-11-02'], dtype='
I can search it with where. I'm using astype to restrict the field to 7 characters.
In [236]: np.where(data['Start_Date'].astype('U7')=='2012-10')[0]
Out[236]:
array([18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86,
87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
I can use usecols to get around the variable line lengths - assuming the 'bad' lines just differ in the latter fields.
In [237]: data = np.genfromtxt('../Downloads/2012-general-election-romney-vs-oba
...: ma.csv',dtype=None,delimiter=',',names=True,invalid_raise=False,encodi
...: ng=None,usecols=range(10))
In [238]: data.shape
Out[238]: (586,)
In [239]: np.where(data['Start_Date'].astype('U7')=='2012-10')[0]
Out[239]:
array([ 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,
58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
97, 98, 99, 100])
I can get the same list with an iterative search like yours:
In [244]: alist = []
In [245]: for i,date in enumerate(data['Start_Date']):
...: if date[:7] == '2012-10':
...: alist.append(i)
...:
In [246]: len(alist)
Out[246]: 82
In [247]: np.array(alist)
Out[247]:
array([ 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,
58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
97, 98, 99, 100])

Identifying edges in a connected component - Python

I have a set of connected components obtained from a tree Tas follows.
To get all the connected components, I'm using the following code sample:
connectedcomponents = sorted(nx.connected_components(T), key = len,reverse=True)
This is what I'm getting;
[{66, 98, 68, 37, 7, 8, 73, 75, 44, 47, 81, 51, 19, 23, 55, 56, 58, 62, 63}, {97, 3, 6, 71, 39, 11, 77, 17, 60, 95}, {99, 5, 10, 43, 15, 20, 90, 92, 93}, {96, 4, 76, 80, 84, 53, 52}, {34, 74, 46, 18, 24, 61, 30}, {36, 9, 41, 83, 88, 57}, {65, 69, 40, 78, 21, 54}, {1, 2, 12, 13}, {89, 26, 70, 31}, {0, 42, 28, 79}, {32, 85, 86}, {59, 45, 94}, {82, 50, 22}, {64, 72}, {33, 14}, {16}, {87}, {48}, {91}, {49}, {67}, {29}, {35}, {25}, {38}, {27}]
I need to get the edges in each of these components. For an example, for the first component {66, 98, 68, 37, 7, 8, 73, 75, 44, 47, 81, 51, 19, 23, 55, 56, 58, 62, 63} I need a separate list of edges as [(37,47),(47,7),(7,62),...].
I tried it as follows:
def nodes_connected(u, v):
if u in T.neighbors(v):
return True
else:
return False
for i in connectedcomponents:
print(i)
for u,v in i:
nodes_connected("u", "v")
edges.append((u,v))
But didn't work!!!
Can someone please help me with this?

Resources