How to get k means cluster for 1D data?

How to get k means cluster for 1D data? - python-3.x

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 4.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 7.45
... ...
Now I want to get two clusters for the mse values so that I know what values lies to which cluster and their mean.
Now since I do not have any other set of values apart from mse (I have to provide X and Y), I would like to use just mse values to get a k means cluster.For now for the other set of values, I pass it as range which is of same size as no of mse values.This is what I did
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv("generate_csv/all_data_device.csv", parse_dates=["date"])
f1 = df['mse'].values
# generate another list
f2 = list(range(0, len(f1)))
X = np.array(list(zip(f1, f2)))
kmeans = KMeans(n_clusters=2).fit(X)
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
#print(centroids)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], c=labels)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='#050505', s=1000)
plt.title('K Mean Classification')
plt.show()
How can I just use the mse values to get the k means cluster?I am aware of the function 'reshape()' but not quite sure how to use it?

Demo:
In [29]: kmeans = KMeans(n_clusters=2)
In [30]: df['label'] = kmeans.fit_predict(df[['mse']])
# NOTE: ----> ^ ^
In [31]: df
Out[31]:
date mse label
0 2018-02-11 14.34 0
1 2018-02-12 7.24 0
2 2018-02-13 4.50 0
3 2018-02-14 3.50 0
4 2018-02-16 12.67 0
5 2018-02-21 45.66 0
6 2018-02-22 15.33 0
7 2018-02-24 98.44 1
8 2018-02-26 23.55 0
9 2018-02-27 45.12 0
10 2018-02-28 78.44 1
11 2018-03-01 34.11 0
12 2018-03-05 23.33 0
13 2018-03-06 7.45 0
plotting:
In [64]: ax = df[df['label']==0].plot.scatter(x='mse', y='label', s=50, color='white', edgecolor='black')
In [65]: df[df['label']==1].plot.scatter(x='mse', y='label', s=50, color='white', ax=ax, edgecolor='red')
Out[65]: <matplotlib.axes._subplots.AxesSubplot at 0xfa42be0>
In [66]: plt.scatter(kmeans.cluster_centers_.ravel(), [0.5]*len(kmeans.cluster_centers_), s=100, color='green', marker='*')
Out[66]: <matplotlib.collections.PathCollection at 0xfabf208>

Related

Plot Histogram on different axes

I am reading CSV file:
Notation Level RFResult PRIResult PDResult Total Result
AAA 1 1.23 0 2 3.23
AAA 1 3.4 1 0 4.4
BBB 2 0.26 1 1.42 2.68
BBB 2 0.73 1 1.3 3.03
CCC 3 0.30 0 2.73 3.03
DDD 4 0.25 1 1.50 2.75
AAA 5 0.25 1 1.50 2.75
FFF 6 0.26 1 1.42 2.68
...
...
Here is the code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.rad_csv('home\NewFiles\Files.csv')
Notation = df['Notation']
Level = df['Level']
RFResult = df['RFResult']
PRIResult = df['PRIResult']
PDResult = df['PDResult']
fig, axes = plt.subplots(nrows=7, ncols=1)
ax1, ax2, ax3, ax4, ax5, ax6, ax7 = axes.flatten()
n_bins = 13
ax1.hist(data['Total'], n_bins, histtype='bar') #Current this shows all Total Results in one plot
plt.show()
I want to show each Level Total Result in each different axes like as follow:
ax1 will show Level 1 Total Result
ax2 will show Level 2 Total Result
ax3 will show Level 3 Total Result
ax4 will show Level 4 Total Result
ax5 will show Level 5 Total Result
ax6 will show Level 6 Total Result
ax7 will show Level 7 Total Result

You can select a filtered part of a dataframe just by indexing: df[df['Level'] == level]['Total']. You can loop through the axes using for ax in axes.flatten(). To also get the index, use for ind, ax in enumerate(axes.flatten()). Note that Python normally starts counting from 1, so adding 1 to the index would be a good choice to indicate the level.
Note that when you have backslashes in a string, you can escape them using an r-string: r'home\NewFiles\Files.csv'.
The default ylim is from 0 to the maximum bar height, plus some padding. This can be changed for each ax separately. In the example below a list of ymax values is used to show the principle.
ax.grid(True, axis='both) sets the grid on for that ax. Instead of 'both', also 'x' or 'y' can be used to only set the grid for that axis. A grid line is drawn for each tick value. (The example below tries to use little space, so only a few gridlines are visible.)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N), 'Total': np.random.uniform(1, 5, N)})
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
ymax_per_level = [27, 29, 28, 26, 27]
for ind, (ax, lev_ymax) in enumerate(zip(axes.flatten(), ymax_per_level)):
level = ind + 1
n_bins = 13
ax.hist(df[df['Level'] == level]['Total'], bins=n_bins, histtype='bar')
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.set_ylim(0, lev_ymax)
ax.grid(True, axis='both')
plt.show()
PS: A stacked histogram with custom legend and custom vertical lines could be created as:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N),
'RFResult': np.random.uniform(1, 5, N),
'PRIResult': np.random.uniform(1, 5, N),
'PDResult': np.random.uniform(1, 5, N)})
df['Total'] = df['RFResult'] + df['PRIResult'] + df['PDResult']
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
colors = ['crimson', 'limegreen', 'dodgerblue']
column_names = ['RFResult', 'PRIResult', 'PDResult']
level_vertical_line = [1, 2, 3, 4, 5]
for level, (ax, vertical_line) in enumerate(zip(axes.flatten(), level_vertical_line), start=1):
n_bins = 13
level_data = df[df['Level'] == level][column_names].to_numpy()
# vertical_line = level_data.mean()
ax.hist(level_data, bins=n_bins,
histtype='bar', stacked=True, color=colors)
ax.axvline(vertical_line, color='gold', ls=':', lw=2)
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.margins(x=0.01)
ax.grid(True, axis='both')
legend_handles = [Patch(color=color) for color in colors]
axes[0].legend(legend_handles, column_names, ncol=len(column_names), loc='lower center', bbox_to_anchor=(0.5, 1.02))
plt.show()

Hide x-axis labels in Matplotlib

I have used bar plot to display the following dataframe:
city pred actual
9 j 10.05 12.68
0 a 9.72 9.56
6 g 8.29 9.11
2 c 8.22 8.49
3 d 7.88 7.92
8 i 7.04 7.35
5 f 6.06 6.33
1 b 5.94 6.00
7 h 5.52 5.72
4 e 5.37 5.62
10 k 6.04 5.50
Code to plot:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10, 7)
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'g']
df = df.sort_values(by=['actual'], ascending=False)
ax = df.plot(x="city", y=["actual", "pred"], kind="bar", color = colors, alpha=0.8)
plt.legend(["actual", "pred"], fontsize=15)
plt.gca().set_xticklabels(df['city'])
plt.suptitle("pred vs actual", fontsize=18)
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=2), (p.get_x()+p.get_width()/2., \
p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.tight_layout()
plt.show()
Output:
What I'm trying try to do is to hide unwanted city text labels from x axis. My expected output will like this:
How can I do that? Thank you.

This line of code is the only one I find works:
ax.xaxis.label.set_visible(False)
If you have other solutions, welcome to share.

Can not plot a 3d bar use matplotlib

I want to plot a 3d bar use matplotlib.
I have a dataframe like this
In[1]: mf
Out[1]: 1 2 4
0
6N 45.238806 104.102564 16.804965
12S 25.597015 95.128205 13.156028
18S 29.689055 76.730769 17.078014
7S 0.000000 156.602564 20.106383
12S 25.597015 95.128205 13.156028
25S 0.000000 151.217949 16.929078
2S 4.962687 49.358974 32.517730
14N 0.000000 0.000000 33.386525
24S 10.447761 71.346154 25.343972
I want to plot a 3d bar in the dataframe corresponding position.
My code like this:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax1 = fig.add_subplot(111, projection='3d')
xpos = [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9]
ypos = [3,2,1,3,2,1,3,2,1,3,2,1,3,2,1,3,2,1,3,2,1,3,2,1,3,2,1]
zpos = np.zeros(27)
dx = np.ones(27)
dy = np.ones(27)
# to reshape my dataframe to a np vector
nf = mf.values
dz = np.reshape(nf,(1,27))
ax1.bar3d(xpos, ypos, zpos, dx,dy,dz, color="#00ceaa")
but I get this error:
15 dz = np.reshape(nf,(1,27))
16 dz
---> 17 ax1.bar3d(xpos, ypos, zpos, dx,dy,dz, color="#00ceaa")
bar3d(self, x, y, z, dx, dy, dz, color, zsort, shade, *args, **kwargs)
2526
2527 if shade:
-> 2528 normals = self._generate_normals(polys)
2529 sfacecolors = self._shade_colors(facecolors, normals)
in _generate_normals(self, polygons)
1771 v1 = np.array(verts[0]) - np.array(verts[1])
1772 v2 = np.array(verts[2]) - np.array(verts[0])
-> 1773 normals.append(np.cross(v1, v2))
1774 return normals
in cross(a, b, axisa, axisb, axisc, axis)
1716 "(dimension must be 2 or 3)")
1717 if a.shape[-1] not in (2, 3) or b.shape[-1] not in (2, 3):
-> 1718 raise ValueError(msg)
1719
1720 # Create the output array
ValueError: incompatible dimensions for cross product
(dimension must be 2 or 3)
Where is my code wrong I did not have thinks, thanks a lot.

You need to reshape your df.values like this:
dz = np.reshape(nf,(27))
such that all arrays have the same shape (i.e. (27,), check dx.shape, dy.shape,z.shape,...).
Also note that (while not required) it's good practice to declare both your xpos and ypos lists as np.array like:
xpos = np.array([1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9])

How to ensure centroids of the clusters in k means algorithm doesn't switch everytime?

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 244.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 127.45
... ...
... ...
Now I try to apply k means to the mse values to get 2 clusters which gives me 2 centroids one for each.Now I am given a mse value and I need to find for which of the two centroids is nearer to the given mse value.I do something like this
from sklearn.cluster import KMeans
import pandas as pd
centroid_list = []
given_mse = 7.382409087
kmeans = KMeans(n_clusters=2)
df = pd.read_csv("data.csv", parse_dates=["date"])
kmeans.fit_predict(df[['mse']])
centroid_list.append(kmeans.cluster_centers_.ravel())
#print(centroids_list) # array([ 153.27996598, 19810.6925875 ]
for i in centroids_list:
t1 = abs(given_mse - i[0])
t2 = abs(given_mse - i[1])
if t1 < t2:
result.append("label 1")
else:
result.append("label 2")
print(result) # ['label1']
Now as you can see I get two centroid values 153.27996598 and 19810.6925875 assigned to each cluster.
The problem is it keeps switching the values often [(x,y) or (y,x)] when you run the program because of which I get the end result as either label1 or at times label2.
Any idea how this can be fixed.Is there any sckit-learn technique to prevent this switching?

As mentioned by #Vivek Kumar, I needed to pass an additional parameter random_state while setting the k means.The value for random_state can be any integer.
kmeans = KMeans(n_clusters=2, random_state=1)

How to get value of each centroids in a k means cluster?

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 244.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 127.45
... ...
... ...
Now I want to get two clusters so that I know what values lies to which cluster and their mean.
Now it usually needs two parameters or set of values.Since I am just concerned about the mse values and a cluster around it, I pass the other parameter as range which is of same size as no of mse values.This is what I did
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv("generate_csv/all_data_device.csv", parse_dates=["date"])
df = df[df['mse'].values < 15000]
f1 = df['mse'].values
# generate another list of equal size
f2 = list(range(0, len(f1)))
X = np.array(list(zip(f1, f2)))
kmeans = KMeans(n_clusters=2).fit(X)
labels = kmeans.predict(X)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], c=labels)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='#050505', s=1000)
plt.title('K Mean Classification (mse < 15000)')
plt.show()
This is what I get
Now I can get centroid coordinates by doing something like this
# Centroid coordinates
centroids = kmeans.cluster_centers_
print(centroids)
But I want the value of each of the centroids.In other words since the centroids represent mean of all the mse values under each cluster, I want this mean value for each of the cluster.How can I do it?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to get k means cluster for 1D data? - python-3.x

Related

Plot Histogram on different axes

Hide x-axis labels in Matplotlib

Can not plot a 3d bar use matplotlib

How to ensure centroids of the clusters in k means algorithm doesn't switch everytime?

How to get value of each centroids in a k means cluster?

Categories

Resources