How to get value of each centroids in a k means cluster? - python-3.x

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 244.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 127.45
... ...
... ...
Now I want to get two clusters so that I know what values lies to which cluster and their mean.
Now it usually needs two parameters or set of values.Since I am just concerned about the mse values and a cluster around it, I pass the other parameter as range which is of same size as no of mse values.This is what I did
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv("generate_csv/all_data_device.csv", parse_dates=["date"])
df = df[df['mse'].values < 15000]
f1 = df['mse'].values
# generate another list of equal size
f2 = list(range(0, len(f1)))
X = np.array(list(zip(f1, f2)))
kmeans = KMeans(n_clusters=2).fit(X)
labels = kmeans.predict(X)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], c=labels)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='#050505', s=1000)
plt.title('K Mean Classification (mse < 15000)')
plt.show()
This is what I get
Now I can get centroid coordinates by doing something like this
# Centroid coordinates
centroids = kmeans.cluster_centers_
print(centroids)
But I want the value of each of the centroids.In other words since the centroids represent mean of all the mse values under each cluster, I want this mean value for each of the cluster.How can I do it?

Related

Python - Add new curve from a df into existing lineplot

I create a plot using sns base on a DafaFrame.
Now, I would like to add new curve from another dataframe on the plot created previusly.
This is the code of my plot:
tline = sns.lineplot(x='reads', y='time', data=df, hue='method', style='method', markers=True, dashes=False, ax=axs[0, 0])
tline.set_xlabel('Numero di reads')
tline.set_ylabel ('Time [s]')
tline.legend(loc='lower right')
tline.set_yscale('log')
tline.autoscale(enable=True, axis='x')
tline.autoscale(enable=True, axis='y')
Now I have another Dataframe with the same column of the first DataFrame. How can I add this new curve with a custom entry in the legend?
This is the structure of the DataFrame:
Dataset
Method
Reads
Time
Peak-memory
14M
Set
14000000
7.33
1035204
20K
Set
200000
0.38
107464
200K
Set
20000
0.07
42936
2M
Set
28428648
16.09
2347740
28M
Set
2000000
1.41
240240
I suggest to use matplotlibs OOP interface like this
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
# generate sample data
time_column = np.arange(10)
data_column1 = np.random.randint(0, 10, 10)
data_column2 = np.random.randint(0, 10, 10)
# store in pandas dfs
df1 = pd.DataFrame(zip(time_column, data_column1), columns=['Time', 'Data'])
df2 = pd.DataFrame(zip(time_column, data_column2), columns=['Time', 'Data'])
f, ax = plt.subplots()
sns.lineplot(df1.Time, df1.Data, label='foo', ax=ax)
sns.lineplot(df2.Time, df2.Data, label='bar', ax=ax)
ax.legend()
plt.show()
which generates the following output
the important thing is that both lineplots are on the same subplot (ax in this case).

How can I create a forest plot?

I would like to combine different risk ratios into one forest plot. I would expect the output to be similar to metamiss in STATA or metafor in R. How can I do this in Python?
By using the zEPID package I create a forest plot of different risk ratios.
import matplotlib.image as mpimg
import numpy as np
import matplotlib.pyplot as plt
import zepid
from zepid.graphics import EffectMeasurePlot
labs = ["ACA(Isq=41.37% Tausq=0.146 pvalue=0.039 )",
"ICA0(Isq=25.75% Tausq=0.092 pvalue=0.16 )",
"ICA1(Isq=60.34% Tausq=0.121 pvalue=0.00 )",
"ICAb(Isq=25.94% Tausq=0.083 pvalue=0.16 )",
"ICAw(Isq=74.22% Tausq=0.465 pvalue=0.00 )"]
measure = [2.09,2.24,1.79,2.71,1.97]
lower = [1.49,1.63,1.33,2.00,1.25]
upper = [2.92,3.07,2.42,3.66,3.11]
p = EffectMeasurePlot(label=labs, effect_measure=measure, lcl=lower, ucl=upper)
p.labels(effectmeasure='RR')
p.colors(pointshape="D")
ax=p.plot(figsize=(7,3), t_adjuster=0.09, max_value=4, min_value=0.35 )
plt.title("Random Effect Model(Risk Ratio)",loc="right",x=1, y=1.045)
plt.suptitle("Missing Data Imputation Method",x=-0.1,y=0.98)
ax.set_xlabel("Favours Control Favours Haloperidol ", fontsize=10)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(True)
ax.spines['left'].set_visible(False)
plt.savefig("Missing Data Imputation Method",bbox_inches='tight')
The statsmodels library has an API for doing simple meta-analysis and plotting forest plots. It supports DerSimonian-Laird (chi2) and Paule-Mandel (iterated). See the statsmodel docs for more use cases, options and examples.
An example from their docs:
import numpy as np
from statsmodels.stats.meta_analysis import combine_effects
# dummy data
mean_effect = np.array([61.00,61.40,62.21,62.30,62.34,62.60,62.70,62.84,65.90])
var_effect = np.array([0.2025,1.2100,0.0900,0.2025,0.3844,0.5625,0.0676,0.0225,1.8225])
idx = ['lab1','lab2','lab3','lab4','lab5','lab6','lab7','lab8','lab9']
# meta-analysis and forest plot
results = combine_effects(mean_effect, var_effect, method_re="chi2", use_t=True, row_names=idx)
print(results.summary_frame())
fig = results.plot_forest()
Output:
eff sd_eff ci_low ci_upp w_fe w_re
lab1 61.000000 0.450000 60.118016 61.881984 0.057436 0.123113
lab2 61.400000 1.100000 59.244040 63.555960 0.009612 0.040314
lab3 62.210000 0.300000 61.622011 62.797989 0.129230 0.159749
lab4 62.300000 0.450000 61.418016 63.181984 0.057436 0.123113
lab5 62.340000 0.620000 61.124822 63.555178 0.030257 0.089810
lab6 62.600000 0.750000 61.130027 64.069973 0.020677 0.071005
lab7 62.700000 0.260000 62.190409 63.209591 0.172052 0.169810
lab8 62.840000 0.150000 62.546005 63.133995 0.516920 0.194471
lab9 65.900000 1.350000 63.254049 68.545951 0.006382 0.028615
fixed effect 62.583397 0.107846 62.334704 62.832090 1.000000 NaN
random effect 62.390139 0.245750 61.823439 62.956838 NaN 1.000000
fixed effect wls 62.583397 0.189889 62.145512 63.021282 1.000000 NaN
random effect wls 62.390139 0.294776 61.710384 63.069893 NaN 1.000000
I’d also recommend having a read through the docs for the individual methods such as combine_effects() which contains additional notes and references regarding the implementation.
Since I haven't found a customizable package to create a forest plot, I developed myforestplot for that purpose.
The following is one example of a forest plot using titanic dataset.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import matplotlib.pyplot as plt
import myforestplot as mfp
data = (pd.read_csv("titanic.csv")
[["survived", "pclass", "sex", "age", "embark_town"]]
.dropna()
)
ser = data["age"]
data["age"] = (ser
.mask(ser >= 40, "40 or more")
.mask(ser < 40, "20_39")
.mask(ser <20, "0_19")
)
res = smf.logit("survived ~ sex + age + embark_town", data=data).fit()
order = ["age", "sex", "embark_town"]
cont_cols = []
item_order = {"embark_town": ['Southampton', 'Cherbourg', 'Queenstown'],
"age": ["0_19", "20_39", "40 or more"]
}
df = mfp.statsmodels_pretty_result_dataframe(data, res,
order=order,
cont_cols=cont_cols,
item_order=item_order,
fml=".3f",
)
df is a dataframe for creating a forest plot.
plt.rcParams["font.size"] = 8
fp = mfp.SimpleForestPlot(ratio=(8,3), dpi=150, figsize=(5,3), df=df,
vertical_align=True)
fp.errorbar(errorbar_kwds=None, log_scale=True)
xticklabels = [0.1, 0.5, 1.0, 2.0]
fp.ax2.set_xlim(np.log([0.1, 1.5]))
fp.ax2.set_xticks(np.log(xticklabels))
fp.ax2.set_xticklabels(xticklabels)
fp.ax2.set_xlabel("OR (log scale)")
fp.ax2.axvline(x=0, ymin=0, ymax=1.0, color="black", alpha=0.5)
fp.ax1.set_xlim([0.35, 1])
fp.embed_cate_strings("category", 0.3, header="Category",
text_kwds=dict(fontweight="bold"),
header_kwds=dict(fontweight="bold")
)
fp.embed_strings("item", 0.36, header="", replace={"age":""})
fp.embed_strings("nobs", 0.60, header="N")
fp.embed_strings("risk_pretty", 0.72, header="OR (95% CI)")
fp.horizontal_variable_separators()
fp.draw_outer_marker(log_scale=True, scale=0.008)
plt.show()
and we obtain the figure.
A forest plot image

Local Outlier Factor only calculated for some points (scikitLearn)

I have a large csv file, containing 2 columns representing the result of k-means clustering. I calculated 11 centroids, and the csv-file contains which one is the closest and which distance the point has towards this centroid.
The entries look like:
K11-closest,K11-distance
0,31544.821603570384
0,31494.23348984612
0,31766.471900874752
0,31710.896696452823
Then I want to calculate and plot the LOF using a script I found on scikit-learn.org
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
dataset = pd.read_csv('0.csv')
clf = LocalOutlierFactor(n_neighbors=20)
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).
y_pred = clf.fit_predict(dataset)
X_scores = clf.negative_outlier_factor_
plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset.iloc[:, 0], dataset.iloc[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset.iloc[:, 0].values, dataset.iloc[:, 1].values, s=50 * radius, edgecolors='r',
facecolors='none', label='Outlier scores')
plt.show()
But the plot shows:
With black points being the date points and red is a circle, showing how much it is an outlier
So I assume the LOF is not calculated for every point. But why? And how I calculate it for every point? And make it visible in the plot
normalising the data will help you in making more visible graphs and as per your code you have taken multipier of radius as 50 and I have taken 1000.
As we can see the algorithm does not mark red circle for every data point and it also depends on nearest neighbours(n_neighbors) we are taking into account fro algo to mark the circles.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dataset = pd.DataFrame(data=[[0, 31544.821603570384], [0,31494.23348984612], \
[0,31766.471900874752], [0,31710.896696452823]], \
columns=["K11-closest","K11-distance"])
dataset = scaler.fit_transform(dataset)
clf = LocalOutlierFactor(n_neighbors=3)
y_pred = clf.fit_predict(dataset)
X_scores = clf.negative_outlier_factor_
plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset[:, 0], dataset[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset[:, 0], dataset[:, 1], s=1000 * radius, edgecolors='r',
facecolors='none', label='Outlier scores')
legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

How to plot the output of k-means clustering of word embedding using python?

I have used gensims word embeddings to find vectors of each word. Then I used K-means to find clusters of word. There are close to 10,000 tokens/words and I want to plot them.
I want to plot the result in the following way:
Annotate points with name of words
Different color for clusters
Here is what I have done.
tsne = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500)#, random_state=13)
def tsne_plot(data):
"Creates and TSNE model and plots it"
data=data.sample(n = 500).reset_index()
word=data["word"]
cluster=data["clusters"]
data=data.drop(["clusters","word"],axis=1)
X = tsne.fit_transform(data)
plt.figure(figsize=(48, 48))
for i in range(len(X)):
plt.scatter(X[:,0][i],X[:,1][i],c=cluster[i])
plt.annotate(word[i],
xy=(X[:,0][i],X[:,1][i]),
xytext=(3, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
tsne_plot(data)
Though it's annotating the words but failing to color different groups/clusters?
Anyother other approach which annoates with word anmes and colors different clusters?
This is how it's typically done; with annotations and rainbow colors.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
X = np.array([[5,3],
[10,15],
[15,12],
[24,10],
[30,45],
[85,70],
[71,80],
[60,78],
[55,52],
[80,91],])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print(kmeans.cluster_centers_)
print(kmeans.labels_)
#plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
data = X
labels = kmeans.labels_
#######################################################################
plt.subplots_adjust(bottom = 0.1)
plt.scatter(data[:, 0], data[:, 1], c=kmeans.labels_, cmap='rainbow')
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
label,
xy=(x, y), xytext=(-20, 20),
textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='red', alpha=0.5),
arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
plt.show()
#######################################################################
See the link below for all details.
https://stackabuse.com/k-means-clustering-with-scikit-learn/
See the link below for some samples of how to do annotations with characters, rather tan numbers.
https://nikkimarinsek.com/blog/7-ways-to-label-a-cluster-plot-python

How to ensure centroids of the clusters in k means algorithm doesn't switch everytime?

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 244.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 127.45
... ...
... ...
Now I try to apply k means to the mse values to get 2 clusters which gives me 2 centroids one for each.Now I am given a mse value and I need to find for which of the two centroids is nearer to the given mse value.I do something like this
from sklearn.cluster import KMeans
import pandas as pd
centroid_list = []
given_mse = 7.382409087
kmeans = KMeans(n_clusters=2)
df = pd.read_csv("data.csv", parse_dates=["date"])
kmeans.fit_predict(df[['mse']])
centroid_list.append(kmeans.cluster_centers_.ravel())
#print(centroids_list) # array([ 153.27996598, 19810.6925875 ]
for i in centroids_list:
t1 = abs(given_mse - i[0])
t2 = abs(given_mse - i[1])
if t1 < t2:
result.append("label 1")
else:
result.append("label 2")
print(result) # ['label1']
Now as you can see I get two centroid values 153.27996598 and 19810.6925875 assigned to each cluster.
The problem is it keeps switching the values often [(x,y) or (y,x)] when you run the program because of which I get the end result as either label1 or at times label2.
Any idea how this can be fixed.Is there any sckit-learn technique to prevent this switching?
As mentioned by #Vivek Kumar, I needed to pass an additional parameter random_state while setting the k means.The value for random_state can be any integer.
kmeans = KMeans(n_clusters=2, random_state=1)

Resources