break tie between k-means cluster - statistics

I have dataset on which I applied k-means and there are two clusters but distance from a particular point (x,y) to both cluster is same ,then in which cluster the point will go. please help me.
Thanks in advance.

tldr;
In the case of ties, k-means clustering will randomly assign the ambiguous point to a cluster. (This is based on R's implementation of k-means clustering kmeans.)
A specific example based on the iris data in R
Let's start by loading necessary R libraries
library(broom)
library(tidyverse)
For this example, we will use the Petal.Length and Petal.Width measurements from the iris dataset, and for simplicity exclude the "virginica" measurements so that the "setosa" and "versicolor" measurements form our two groups.
df <- iris %>%
filter(Species != "virginica") %>%
select(starts_with("Petal"), Species)
We now use k-means clustering with k = 2, and assign a cluster label to every (Petal.Length, Petal.Width) measurement; since the assignment of which group is "1" and which group is "2" is random, we use a fixed seed for reproducibility.
set.seed(2018)
kcl <- kmeans(df %>% select(-Species), 2)
df <- augment(kcl, df)
We show a scatterplot of Petal.Length vs. Petal.Width; the known Species labels are shown by the different colours and the inferred cluster association by the different symbols.
ggplot(df, aes(Petal.Length, Petal.Width, colour = Species)) +
geom_point(aes(shape = .cluster), size = 3)
Let's manually calculate the within-cluster sum of squared pairwise distances; since we'll be needing this later as well, we'll create a function calculate_d.
calculate_d <- function(df) {
df %>%
select(.cluster, Petal.Length, Petal.Width) %>%
group_by(.cluster) %>%
nest() %>%
mutate(dist = map_dbl(data, ~sum(as.matrix(dist(.x)^2)) / (2 * nrow(.x)))) %>%
pull(dist)
}
calculate_d(df)
#[1] 2.0220 12.7362
Notice how the distances are identical to the within-cluster sum of squares (WCSS)
kcl$withinss
#[1] 2.0220 12.7362
Now let's add a new measurement that has the same Euclidean distance from both cluster centers: to do so, we choose the point that lies exactly half-way between both cluster centers if you connect them by a straight line. All we need then is a bit of basic trigonometry to construct that point:
z <- kcl$centers[2, ] - kcl$center[1, ]
theta <- atan(z[2] / z[1])
dy <- sin(theta) * dist(kcl$centers) / 2
dx <- cos(theta) * dist(kcl$centers) / 2
x <- as.numeric(kcl$centers[1, 1] + dx)
y <- as.numeric(kcl$centers[1, 2] + dy)
We store our new point together with the 2 cluster centers in a new data.frame. The first two rows give the position of cluster "1" and "2", and the third row contains our new point.
df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
Let's show the new point together with the cluster centers on top of our (Petal.Length, Petal.Width) measurements.
df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
ggplot(df, aes(Petal.Length, Petal.Width)) +
geom_point(aes(colour = Species, shape = .cluster), size = 3) +
geom_point(data = df2, aes(Petal.Length, Petal.Width), size = 4)
We confirm that the squared Euclidean distance between the new point and each cluster center is indeed the same; to do so we calculate the pairwise distances of our new point "3" to the cluster centres "1" and "2":
as.matrix(dist(df2))[, 3]
# 1 2 3
#1.4996 1.4996 0.0000
Now let's add our new point to the (Petal.Length,Petal.Width) measurements, and calculate the within cluster sum of squared pairwise distances, first with assigning our new point to cluster "1" and then with assigning our new point to cluster "2".
# Add new point and assign to cluster "1"
df.1 <- df %>%
bind_rows(cbind.data.frame(
Petal.Length = x,
Petal.Width = y,
Species = factor("setosa", levels = levels(df$Species)),
.cluster = factor(1, levels = 1:2)))
calculate_d(df.1)
#[1] 4.226707 12.736200
# Add new point and assign to cluster "2"
df.2 <- df %>%
bind_rows(cbind.data.frame(
Petal.Length = x,
Petal.Width = y,
Species = factor("versicolor", levels = levels(df$Species)),
.cluster = factor(2, levels = 1:2)))
calculate_d(df.2)
#[1] 2.02200 14.94091
Notice how the within-cluster squared pairwise distances are different even though the new point has exactly the same distances from either cluster centre. However notice also, how the sum of the within-cluster squared pairwise distances is the same!
sum(calculate_d(df.1))
#[1] 16.96291
sum(calculate_d(df.2))
#[1] 16.96291
identical(sum(calculate_d(df.2)), sum(calculate_d(df.1)))
# [1] TRUE
To show that kmeans assigns the new point at random to either cluster we repeatedly cluster the data. To do so, we define a convenience function that returns the corresponding Species of the new point following k-means clustering.
kmeans_cluster_data <- function(df) {
kcl <- kmeans(df %>% select(-Species), 2)
df <- augment(kcl, df)
map_cluster_to_Species <- df[1:(nrow(df) - 1), ] %>%
count(Species, .cluster) %>%
split(., .$.cluster)
map_cluster_to_Species[[
df[nrow(df), ] %>%
pull(.cluster) %>%
as.character()]]$Species %>% as.character()
}
We now repeatedly cluster the same data 100 times.
bind_cols(
Iteration = 1:100,
Species = map_chr(1:100, ~kmeans_cluster_data(df.1 %>% select(-.cluster)))) %>%
ggplot(aes(Iteration, Species, group = 1)) +
geom_line() +
labs(title = "Assignment of new point to group")
Notice how the new point gets assigned to either Species group at random.

Related

Function to Convert Square Matrix to Upper Hessenberg with Similarity Transformations

I am attempting to translate a MATLAB function to Python from Timothy Sauer,
Numerical Analysis Second Edition, page 546, Program 12.8. The original function
receives a square matrix and returns a matrix with the same eigenvalues but in
Upper Hessenberg form. The original function creates Householder reflectors to produce zeros in the
offdiagonals of the matrix and performs similarity transformations on the original matrix to
get it to upper hessenberg form.
My Python translation succeeds only in obtaining the eigenvalues for 3x3 matrices
but not for 4x4 matrices. Would anyone know the cause of the error? I pasted my code with success and failing cases below. Thank you.
import numpy as np
import math
norm = lambda v:math.sqrt(np.sum(v**2))
def upper_hessenberg(A):
'''
Translated from Timothy Sauer, Numerical Analysis Second Edition, page 546, Program 12.8
Input: Square Matrix, A
Output: B, a Similar Matrix with Same Eigenvalues as A except in Upper Hessenberg form
V, a matrix containing the reflectors used to produce zeros in the off diagonals
'''
rows, columns = A.shape
B = A[:,:].astype(np.float) #will store the similar matrix
V = np.zeros(shape=(rows,columns),dtype=float) #will store the reflectors
for column in range(columns-2): #start from the 1st column end at the third to last column
row = column
x = B[row+1: ,column] #decapitate the column
reflection_of_x = np.zeros(len(x)) #first entry is the norm, followed by 0s
if abs(norm(x)) <= np.finfo(float).eps: #if there are already 0s inthe offdiagonals skip this column
continue
reflection_of_x[0] = norm(x)
v = reflection_of_x - x # v, (the difference vector) represents the line connecting the original column to the reflection of the column (see Timothy Sauer Num Analysis 2nd Edition Figure 4.11 Householder reflector)
v = v/norm(v) #normalize to length of 1 (unit vector)
V[:len(v), column] = v #save the reflector in an upper triangular matrix called V
#verify with x-2*(x # v * v) should equal a vector with all zeros except the leading entry
column_projections = np.outer(v , v # B[row+1:, column:]) #project each col onto difference vector
B[row+1:, column:] = B[row+1:, column:] - (2 * column_projections)
row_projections = np.outer(v, B[row:, column + 1:] # v).T #project each row onto difference vector
B[row:, column + 1:] = B[row:, column + 1:] - (2 * row_projections)
return V, B
# Algorithm succeeds only with 3x3 matrices
eigvectors = np.array([
[1,3,2],
[4,5,6],
[7,8,9],
])
eigvalues = np.array([
[4,0,0],
[0,3,0],
[0,0,2]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 3x3 matrices, The function successfully produces these eigvals",np.linalg.eigvals(B))
#But with 4x4 matrices it fails
eigvectors = np.array([
[1,3,2,4],
[4,5,6,2],
[7,8,9,5],
[5,2,7,8]
])
eigvalues = np.array([
[4,0,0,0],
[0,3,0,0],
[0,0,2,0],
[0,0,0,1]
])
M = eigvectors # eigvalues # np.linalg.inv(eigvectors)
print("The expected eigvals :", np.linalg.eigvals(M))
V,B = upper_hessenberg(M)
print("For 4x4 matrices, The function fails to obtain correct eigvals",np.linalg.eigvals(B))
Your error is that you try to be too efficient. While the last rows are indeed increasingly reduced with leading zeros, this is not the case for the last columns. So in row_projections you need to remove the limiter row:, change to B[:, column + 1:].
You are using the unstable variant of the "improved" Householder reflector. The older version would use the larger of x_refl - x and x_refl + x by setting reflection_of_x[0] = -np.sign(x[0])*norm(x) (or remove all minus signs there).
The stable variant of the improved reflector would use the binomial trick in the normalization of x_refl - x if this difference becomes too small.
x_refl - x = [ norm(x) - x[0], - x[1:] ]
= [ norm(x[1:])^2/(norm(x) + x[0]), - x[1:] ]
(x_refl - x)/norm(x_refl - x)
[ norm(x[1:]), - (norm(x)+x[0])*(x[1:]/norm(x[1:])) ]
= -----------------------------------------------------
sqrt(2*norm(x)*(norm(x)+x[0]))
While the parts may have wildly different scales, no catastrophic cancellation happens for x[0]>0.
See the discussion about the same algorithm from Golub/van Loan 4th ed. in for further details and opinions and the code from that book.

Sort simmilarity matrix according to plot colors

I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.
path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)
tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)
labels = []
for f in text_files:
if f.endswith('.txt'):
labels.append(f)
pairwise_similarity = sparse_matrix * sparse_matrix.T
pairwise_similarity_array = pairwise_similarity.toarray()
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()
Here is one possibility.
The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors.
I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].
import numpy as np
import matplotlib.pyplot as plt
def create_dummy_sim_mat(n):
sm = np.random.random((n, n))
sm = (sm + sm.T) / 2
sm[range(n), range(n)] = 1
return sm
def argsort_sim_mat(sm):
idx = [np.argmax(np.sum(sm, axis=1))] # a
for i in range(1, len(sm)):
sm_i = sm[idx[-1]].copy()
sm_i[idx] = -1
idx.append(np.argmax(sm_i)) # b
return np.array(idx)
n = 10
sim_mat = create_dummy_sim_mat(n=n)
idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx] # apply reordering for rows and columns
# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)
def ticks(_ax, ti, la):
_ax.set_xticks(ti)
_ax.set_yticks(ti)
_ax.set_xticklabels(la)
_ax.set_yticklabels(la)
ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)
After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).
Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly.
So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry).
This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before
def add_cluster(sm, c=3):
idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
for ic in idx_cluster:
cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise
def get_sim_mat2(sm):
return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)
sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)
idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]
# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)
The results with this second method are quite good (see third image)
but I guess there exist cases where this approach also fails, so I would be happy about feedback.
Edit
I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.
You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.
You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.
You can use the function ticks to reorder the labels:
labels = np.array(labels) # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])
#scleronomic's solution is very elegant, but it also has one shortage, which is we cannot set the number of clusters in the sorted correlation matrix. Assume we are working with a set of variables, in which some of them are weakly correlated
import string
import numpy as np
import pandas as pd
n_variables = 20
n_clusters = 10
n_samples = 100
np.random.seed(100)
names = list(string.ascii_lowercase)[:n_variables]
belongs_to_cluster = np.random.randint(0,n_clusters,n_variables)
latent = np.random.randn(n_clusters,n_samples)
variables = np.random.rand(n_variables,n_samples)
for ind in range(n_clusters):
mask = belongs_to_cluster == ind
# weakening the correlation
if ind % 2 == 0:variables[mask] += latent[ind]*0.1
variables[mask] += latent[ind]
df = pd.DataFrame({key:val for key,val in zip(names,variables)})
corr_mat = np.array(df.corr())
As you can see, there are 10 clusters of variables by construction, however, variables within clusters that has an even index are weakly correlated. If we only want to see roughly 5 clusters in the sorted correlation matrix, maybe we need to find another way.
Based on this post, which is the accepted answer to the question "Clustering a correlation matrix", to sort a correlation matrix into blocks, what we need to find are blocks, where correlations within blocks are high and correlations between blocks are low. However, the solution provided by this accepted answer works best when we know how many blocks are there in the first place, and more importantly, the sizes of the underlying blocks are the same, or at least similar. Therefore, I improved the solution with a new function sort_corr_mat
def sort_corr_mat(corr_mat,clusters_guess):
def _swap_rows(corr_mat, var1, var2):
rs = corr_mat.copy()
rs[var2, :],rs[var1, :]= corr_mat[var1, :],corr_mat[var2, :]
cs = rs.copy()
cs[:, var2],cs[:, var1] = rs[:, var1],rs[:, var2]
return cs
# analysis
max_iter = 500
best_score,current_score,best_count = -1e8,-1e8,0
num_minimua_to_visit = 20
best_corr = corr_mat
best_ordering = np.arange(n_variables)
for i in range(max_iter):
for row1 in range(n_variables):
for row2 in range(n_variables):
if row1 == row2: continue
option_ordering = best_ordering.copy()
option_ordering[row1],option_ordering[row2] = best_ordering[row2],best_ordering[row1]
option_corr = _swap_rows(best_corr,row1,row2)
option_score = score(option_corr,n_variables,clusters_guess)
if option_score > best_score:
best_corr = option_corr
best_ordering = option_ordering
best_score = option_score
if best_score > current_score:
best_count += 1
current_corr = best_corr
current_ordering = best_ordering
current_score = best_score
if best_count >= num_minimua_to_visit:
return best_corr#,best_ordering
return best_corr#,best_ordering
With this function and the corr_mat constructed in the first place, I compared the result obtained with my function (on the right) with that obtained with #scleronomic's solution (in the middle)
sim_mat_sorted = corr_mat[argsort_sim_mat(corr_mat), :][:, argsort_sim_mat(corr_mat)]
corr_mat_sorted = sort_corr_mat(corr_mat,clusters_guess=5)
# Plot results
fig, ax = plt.subplots(1,3,figsize=(18,6))
ax[0].imshow(corr_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(corr_mat_sorted)
Clearly, #scleronomic's solution works much better and faster, but my solution offers more control to the pattern of the output.

Found array with 0 feature(s) (shape=(268215, 0)) while a minimum of 1 is required by StandardScaler

I am solving a problem where I am pulling data of all the ProductIDs and then I iterate through the dataframe to look at unique ProductIDs to perform a set of functions.
Here, item is the ProductID/Item number:
#looping through the big dataframe to get a dataframe pertaining to the unique ID
for item in df2['Item Nbr'].unique():
# fetch item data
df = df2.loc[df2['Item Nbr'] == item]
And then I have a set of custom made python functions:
So, when I get through the first loop (for one productID) it works all great, but when it iterates through the loop and goes to the next Product ID, I am certain that the data it is pulling out is right, but I get this error:
Found array with 0 feature(s) (shape=(268215, 0)) while a minimum of 1 is required by StandardScaler.
Although, the X_train and y_train shapes are : (268215, 6) (268215,)
Code Snippet : (Extra Information)
It is a huge file to show. But the initial big dataframe has
[362988 rows x 7 columns] - for first product and
[268215 rows x 7 columns] - for second product
Expansion of the code:
the big dataframe with two unique product IDS
biqQueryData = get_item_data(verbose=True)
iterate over each unique product ID for extracting a subset of dataframes that pertain to the product ID
for item in biqQueryData['Item Nbr'].unique():
df = biqQueryData.loc[biqQueryData['Item Nbr'] == item]
try:
df_model = model_all_stores(df, item, n_jobs=n_jobs,
train_model=train_model,
test_model=test_model,
tune_model=tune_model,
export_model=export_model,
output=export_demand)
the function model_all_stores
def model_all_stores(df_raw, item_nbr, n_jobs=1, train_model=False,
test_model=False, export_model=False, output=False,
tune_model=False):
"""Models demand for specified item.
Predict the demand of specified item for all stores. Does not
filter for predict hidden demand (the function get_hidden_demand
should be used for this.)
Output: data frame output
"""
# ML model hyperparameters
impute_with = 'median'
n_estimators = 100
min_samples_split = 3
min_samples_leaf = 3
max_depth = None
# load data and subset traited and valid
dfnew = subset_traited_valid(df_raw)
# get known demand
df_ma = get_demand(dfnew)
# impute missing sales data
median_sales = df_ma['Sales Qty'].median()
df_ma['Sales Qty'] = df_ma['Sales Qty'].fillna(median_sales)
# add moving average features
df_ma = df_ma.sort_values('Gregorian Days')
window_list = [7 * x for x in [1, 2, 4, 8, 16, 52]]
for w in window_list:
grouped = df_ma.groupby('Store Nbr')['Sales Qty'].shift(1)
rolling = grouped.rolling(window=w, min_periods=1).mean()
df_ma['MA' + str(w)] = rolling.reset_index(0, drop=True)
X_full = df_ma.loc[:, 'MA7':].values
#print(X_full.shape)
# use full data if not testing/tuning
rows_for_model = df_ma['Known Demand'].notnull()
X = df_ma.loc[rows_for_model, 'MA7':].values
y = df_ma.loc[rows_for_model, 'Known Demand'].values
X_train, y_train = X, y
print(X_train.shape, y_train.shape)
if train_model:
# instantiate model components
imputer = Imputer(missing_values='NaN', strategy=impute_with, axis=0)
scale = StandardScaler()
pca = PCA()
forest = RandomForestRegressor(n_estimators=n_estimators,
max_features='sqrt',
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_depth=max_depth,
criterion='mse',
random_state=42,
warm_start=True,
n_jobs=n_jobs)
# pipeline for model
pipeline_steps = [('imputer', imputer),
('scale', scale),
('pca', pca),
('forest', forest)]
regr = Pipeline(pipeline_steps)
regr.fit(X_train, y_train)
It fails here
Snippet Of data:
biqQueryData (the entire Dataframe)
364174,1084,2019-12-12,,,,0.0
.....
364174,1084,2019-12-13,,,,0.0
188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0
.....
188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0
Subset DF 1:
364174,1084,2019-12-12,,,,0.0
.....
364174,1084,2019-12-13,,,,0.0
Subset DF 2:
188880,397752,19421,2020-02-04,2.0,1.0,1.0,0.0
.....
188881,397752,19421,2020-02-05,2.0,1.0,1.0,0.0
Any help here would be great! Thank you

Data and X axis labels not align

Trying to plot X axis (Event) values on their respective x Axis. Y axis is relative to Time (of the day) when and how long the event lasted. The first label and data plotted are correct. However, the second set of data appears to skip over the major x axis tick and is placed afterwards but before the next major x axis tick. This is repeated for each additional x Axis value plotted. The data does not show a problem with which X axis it should appear on.
Defined the data (source) and can plot the issue with about 50 lines of code.
from bokeh.io import output_file
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show
from bokeh.models.formatters import NumeralTickFormatter
import pandas as pd
import math
output_file("events.html", mode="inline")
x1 = []
y1 = []
x2 = []
y2 = []
colorList = []
shortNames = []
nameAndId = ["Event1", 0]
x1.append(nameAndId)
y1.append(33470)
x2.append(nameAndId)
y2.append(33492)
colorList.append("red")
shortNames.append("Evt1")
nameAndId = ["Event2", 1]
x1.append(nameAndId)
y1.append(34116)
x2.append(nameAndId)
y2.append(34151)
colorList.append("green")
shortNames.append("Evt2")
xAxisLabels = ["Event1", "Event2"]
data = {"x1": x1, "y1": y1, "x2": x2, "y2": y2, "color": colorList,\
"shortName": shortNames}
eventDF = pd.DataFrame(data=data,
columns=("x1", "y1", "x2", "y2", "color",\
"shortName"))
source = ColumnDataSource(eventDF)
yRange = [34151, 33470]
p = figure(plot_width=700, plot_height=750, x_range=xAxisLabels,\
y_range=yRange, output_backend="webgl")
p.xaxis.major_label_orientation = math.pi / -2
p.segment(x0="x1",y0="y1",x1="x2",y1="y2", source=source, color="color"\
line_width=12)
p.yaxis[0].formatter = NumeralTickFormatter(format="00:00:00")
p.xaxis.axis_label = "Events"
labels = LabelSet(x="x2",y="y2", text="shortName", text_font_size="8pt"\
text_color="black", level="glyph", x_offset=-6,\
y_offset=-5, render_mode="canvas", angle=270,\
angle_units="deg", source=source)
p.add_layout(labels)
show(p)
I'm thinking this is something simple I've over-looked like a xAxis formatter. I've tried to define one but none seem to work for my use case. The data doesn't seem to be associated to the xAxisLabel. I Expect Event 1 to show on the first X axis tick with Event 2 on the second X axis tick. Event 1 is correct but for each event afterwards, every major X axis tick is skipped with the data residing between tick marks.
The issue in your code is that the actual value for the x-coordinate you are supplying is:
nameAndId = ["Event2", 1]
This kind of list with a category name and a number in a list is understood by Bokeh as a categorical offset. You are explicitly telling Bokeh to position the glyph a distance of 1 (in "synthetic" coordinates) away from the location of "Event2". The reason things "work" for the Event1 case is that the offset in that case is 0:
nameAndId = ["Event1", 0]
I'm not sure what you are trying to accomplish by passing these lists with the second numerical value, so I can't really offer any additional suggestion except to say that it should probably not be passed on to Bokeh.

Recover elements from each cluster generated by scipy dendrogram

I'm building a dendrogram and truncating it to show only the largest 6 clusters. Also, the labeling is done via a simple leaf label function:
def llf(id):
return str(id)
tree = sch.dendrogram(Z, truncate_mode='lastp',
leaf_label_func=llf, p=6, show_contracted=False,
show_leaf_counts=False, leaf_rotation=90,
no_labels = False, orientation='right')
My output looks like this:
My goal is to replace the non descriptive labels for the leaves with the minimum value of the members from within that cluster. For example, if the top leaf is the cluster that contains the range from 10 to 1000, then I would like to replace '2468' with 10. The actual logic to replace the ticks in the plot is easy to implement:
fig, ax = plt.subplots()
mislabels = ["foo" for i in range(7)]
ax.set_xticklabels(mislabels, fontsize=10, rotation=45)
Any ideas regarding how to extract the values from within the leaders?
So far I'm able to map each singleton leaf to its cluster using fcluster. However, that only maps my initial 1230 points to a cluster. I need to map the point labeled as '2468' to its cluster and I'm not sure how to do that.
Thanks!
I found the way to do it
fig, ax = plt.subplots(2,2,figsize=(10,5))
ax = ax.ravel()
# [idx_plot[k]:, idx_plot[k]:]
for k, val in enumerate(linkages['ward']):
cluster_local = cluster_labels[val]['ward'][6]
leaders = sch.leaders(linkages['ward'][val], cluster_local)
dates_labels = dict()
for v, i in enumerate(leaders[1]):
date_idx = np.where(cluster_local == i)
dates_labels[leaders[0][v]] = (fechas[val][idx_plot[val]:][date_idx[0][0]].strftime('%y/%m'), fechas[val][idx_plot[val]:][date_idx[0][-1]].strftime('%y/%m'))
mislabels = [dates_labels[leaders[0][i]][0] + ', ' + dates_labels[leaders[0][i]][1] for i in range(6)]
yuca = sch.dendrogram(linkages['ward'][val], truncate_mode='lastp', ax=ax[k], leaf_label_func=llf, p=6, show_contracted=False, show_leaf_counts=False,
leaf_rotation=0, no_labels=False, orientation = 'right' )
# ax[k].set_xticklabels(mislabels, fontsize=10, rotation=90)
ax[k].set_yticklabels(mislabels, fontsize=10, rotation=0)
ax[k].set_title(val)
plt.tight_layout()
plt.show()

Resources