Clustering by assigning weights to the attributes - excel

I have a data set in excel sheet which I need to cluster it by assigning weights. How can I do it?

You can define a function that computes the distance between two points by attribute weights into account. An example of this would be weighted euclidean distance
Specifically if there are k attributes for each point in your dataset and if the corresponding weights for the attributes are d1,d2,..,dk then distance between two points X and Y is
d(X,Y) = sum(di * (Xi-Yi)^2) i=1,2..k where Xi is the value of ith attribute for the point X.
If the weights are inverse of the variance of the attribute it reduces to mahalanobis distance
http://en.wikipedia.org/wiki/Mahalanobis_distance
Once you define the distance function you can use K-means to cluster your data.

Related

Principle Component Analysis (PCA) Explained Variance remains the same after changing dataframe position

I have a dataframe where A and B is used to predict C
df = df[['A','B','C']]
array = df.values
X = array[:,0:-1]
Y = array[:,-1]
# Feature Importance
model = GradientBoostingClassifier()
model.fit(X, Y)
print ("Importance:")
print((model.feature_importances_)*100)
#PCA
pca = PCA(n_components=len(df.columns)-1)
fit = pca.fit(X)
print("Explained Variance")
print(fit.explained_variance_ratio_)
This prints
Importance:
[ 53.37975706 46.62024294]
Explained Variance
[ 0.98358394 0.01641606]
However when I changed the dataframe position swapping A and B, only the importance changed, but the Explain variance remains, why did the explained variance not change according to [0.01641606 0.98358394]?
df = df[['B','A','C']]
Importance:
[ 46.40771024 53.59228976]
Explained Variance
[ 0.98358394 0.01641606]
Explained variance does not refer to A or B or any columns of your dataframe. It refers to the principal components identified by the PCA, which are some linear combinations of the columns. These components are sorted in the order of decreasing variance as the documentation says:
components_ : array, shape (n_components, n_features)
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
explained_variance_ : array, shape (n_components,)
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
explained_variance_ratio_ : array, shape (n_components,)
Percentage of variance explained by each of the selected components.
So, the order of features does not affect the order of components returned. It does affect the array components_ which is a matrix that can be used to map principal components to the feature space.

K modes calculating distance between each point and cluster centroid

I have a set of categorical variables to be clustered and so I am using k modes taken from a github package. I want to get the distance of each observation (point) to the centroid of the cluster it belongs to.
This is what I have implemented so far:
kmodes_cao = kmodes.KModes(n_clusters=6, init='Cao', verbose=1)
kmodes_cao.fit_predict(data)
# Print cluster centroids of the trained model.
print('k-modes (Cao) centroids:')
print(kmodes_cao.cluster_centroids_)
# Print training statistics
print('Final training cost: {}'.format(kmodes_cao.cost_))
print('Training iterations: {}'.format(kmodes_cao.n_iter_))
I cannot use the Eucledean distance since the variables are categorical. What is the ideal way to calculate the distance of each point to its cluster centroid?
Example if you have 2 variables V1 which can take A or B and V2 can take C or D
If your centroid is V1=A and V2=D
For each variable i, count when Vi != Ci(centroid i)
if you have an instance V1=A and V2=C then the distance from the centroid is 1
it is a binary distance
hop that will help
You can use the method matching_dissim() from kmodes library.
To compare to rows in your dataset, one can be you centroid and other anyone. First you must install the library Panda then import the method with this line:
from kmodes.util.dissim import matching_dissim
https://github.com/nicodv/kmodes/issues/39

scipy.cluster.hierarchy: labels seems not in the right order, and confused by the value of the vertical axes

I know that scipy.cluster.hierarchy focused on dealing with the distance matrix. But now I have a similarity matrix... After I plot it by using Dendrogram, something weird just happens.
Here is the code:
similarityMatrix = np.array(([1,0.75,0.75,0,0,0,0],
[0.75,1,1,0.25,0,0,0],
[0.75,1,1,0.25,0,0,0],
[0,0.25,0.25,1,0.25,0.25,0],
[0,0,0,0.25,1,1,0.75],
[0,0,0,0.25,1,1,0.75],
[0,0,0,0,0.75,0.75,1]))
here is the linkage method
Z_sim = sch.linkage(similarityMatrix)
plt.figure(1)
plt.title('similarity')
sch.dendrogram(
Z_sim,
labels=['1','2','3','4','5','6','7']
)
plt.show()
But here is the outcome:
My question is:
Why is the label for this dendrogram not right?
I am giving a similarity matrix for the linkage method, but I cannot fully understand what the vertical axes means. For example, as the maximum similarity is 1, why is the maximum value in the vertical axes almost 1.6?
Thank you very much for your help!
linkage expects "distances", not "similarities". To convert your matrix to something like a distance matrix, you can subtract it from 1:
dist = 1 - similarityMatrix
linkage does not accept a square distance matrix. It expects the distance data to be in "condensed" form. You can get that using scipy.spatial.distance.squareform:
from scipy.spatial.distance import squareform
dist = 1 - similarityMatrix
condensed_dist = squareform(dist)
Z_sim = sch.linkage(condensed_dist)
(When you pass a two-dimensional array with shape (m, n) to linkage, it treats the rows as points in n-dimensional space, and computes the distances internally.)

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

scikit-learn: projecting SVM weights of Prinicpal Components to original image space

I did a PCA on my 3D image datasets, and used the first n PCs as my features in a linear SVM. I have SVM weights for each PC. Now, I want to project the PC weights into original image space to find what regions of the image were more discriminative in the classification process. I used the inverse_transform PCA method on the weight vector. However, the resulting image only has positive values, whereas the SVM weights were both positive and negative. This makes me think if my approach is a valid one. Does anybody have any suggestions?
Thanks in advance.
I have a program that does this projection in image space. The thing to realise is that the weights themselves do not define the 'discrimination' weights (as also termed in this paper). You need the sum of the inputs weighted by their kernel coefficients.
Consider this toy example:
Class A has 2 vectors: a1=(1,1) and a2=(2,2)
Class B has 2 vectors: b1=(2,4) and a3=(4,2).
If you draw this, you can construct the decision boundary by hand: it's the line of points (x,y) where x+y == 5. My SVM program finds the solution where w_a1 == 0 (no support vector), w_a2 == -1) and w_b1 == w_b2 == 1/2, and bias == -5.
Now you can construct the projection vector p = a2*w_a2 + b1*w_b1 + b2*w_b2 = -1*(2,2) + 1/2*(2,4) + 1/2*(4,2) = (1,1).
In other words, every point should be projected onto the line y == x, and for a new vector v the inner product <v,p> is below 5 for class A vectors, and above 5 for class B vectors. You can centre the result around 0 by adding the bias.

Resources