I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.
I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior
Does anyone know what kind of problem am I running into?
Here are the possible things that might be going "wrong":
Your k-means cluster initialization points are chosen as the same set of points in each run. I recommend using the 'random' for the init parameter of k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. If that doesn't work then supply to k-means your own set of random initial cluster centers. Remember to initialize your random generator using its seed() method as the current date and time. https://docs.python.org/2/library/random.html uses current date-time as the default value.
Your distance function, i.e. euclidean distance might be the culprit. This is less likely but it is always good to run k-means using cosine similarity especially when you are using it for document similarity. scikits doesn't have this functionality at present but you should look here: Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
These two combined should give you good clusters.
I noticed with the help of above answers and comments that there was a problem with outliers and noise in original space. For this, we should use a dimensionality reduction method which eliminates the unwanted noise in the data. I tried random projections first but it failed to work with text data, simply because the problem was still not solved.
Then using Truncated Singular Value Decomposition, I was able to get perfect uniform clusters. Hence, the Truncated SVD is the way to go with textual data in my opinion.
Related
I have a dataset of 590000 records after preprocessing and i wanted to find clusters out of it and it contains string data (for now assume i have only one column with 590000 unique values in dataset). Also i am using custom defined distance measure and needed to calculate the distance matrix of size 590000*590000. Using some partition logic i created the distance matrix but cannot merge those partitions into one big distance matrix due to memory constarints. Does anyone have any sort of idea to resolve it ?? I picked DBSCAN for it. Is there any way to use deep learning methodologies?? any other ideas
Use a manageable sample first.
Because I doubt the results will be good enough to warrant any effort on scaling a method that does not work anyway.
i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.
Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.
My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Also, if you look at the Java code example there is also this
The number of columns should be small, e.g, less than 1000.
On the other hand, if you look at ML documentation, there are no limitations mentioned.
So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?
PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.
These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.
When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.
If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.
Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.
Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.
Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.
I have more than 10^8 records stored in elasticSearch. Now I want to clustering them by writing a hierarchical algorithm or using PIC based on spark MLlib.
However, I can't use some efficient algorithm like K-means because every record is stored in the form of
{mainID:[subId1,subId2,subId3,...]}
which obviously is not in euclidean space.
I need to calculate the distance of every pair of records which will take a very LONG time I guess (10^8 * 10^8). I know the cartesian product in spark to do such computing , but there will appear the duplicated ones like (mainID1,mainID2) and (mainID2,mainID1), which is not suitable to PIC.
Does anyone know a better way to cluster these records? Or any method to delete the duplicated ones in the result RDD of cartesian product?
Thanks A lot!
First of all, don't take the full Cartesian product:
select where a.MainID > b.MainID
This doesn't reduce the complexity, but it does save about 2x in generation time.
That said, consider your data "shape" and select the clustering algorithm accordingly. K-means, HC, and PIC have three different applications. You know K-means already, I'm sure.
PIC basically finds gaps in the distribution of distances. It's great for well-defined sets (clear boundaries), even when those curl around each other or nest. However, if you have a tendril of connecting points (like a dumbbell with a long, thin bar), PIC will not separate the obvious clusters.
HC is great for such sets, and is a good algorithm in general. Most HC algorithms have an "understanding" of density, and tend to give clusterings that fit human cognition's interpretation. However, HC tends to be slow.
I strongly suggest that you consider a "seeded" algorithm: pick a random subset of your points, perhaps
sqrt(size) * dim
points, where size is the quantity of points (10^8) and dim is the number of dimensions. For instance, your example has 5 dimensions, so take 5*10^4 randomly selected points. Run the first iterations on those alone, which will identify centroids (K-means), eigenvectors (PIC), or initial hierarchy (HC). With those "seeded" values, you can now characterize each of the candidate clusters with 2-3 parameters. Classifying the remaining 10^8 - 5*10^4 points against 3 parameters is a lot faster, being O(size) time instead of O(size^2).
Does that get you moving toward something useful?
I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster. For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster.
I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space. I have words of strings at my hand here. It seems that, the question of how to represent strings in a numerical vector space and to calculate "means" of string clusters is not sufficiently answered, according to my survey efforts until now. A naive approach to attack this problem would be to combine k-Means clustering with Levenshtein distance, but the question still remains "How to represent "means" of strings?". There is a weight called as TF-IDF weigt, but it seems that it is mostly related to the area of "text document" clustering, not for the clustering of single words. It seems that there are some special string clustering algorithms existing, like the one at http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf
My search in this area is going on still, but I wanted to get ideas from here as well. What would you recommend in this case, is anyone aware of any methods for this kind of problem?
Don't look for clustering. This is misleading. Most algorithms will (more or less forcefully) break your data into a predefined number of groups, no matter what. That k-means isn't the right type of algorithm for your problem should be rather obvious, isn't it?
This sounds very similar; the difference is the scale. A clustering algorithm will produce "macro" clusters, e.g. divide your data set into 10 clusters. What you probably want is that much of your data isn't clustered at all, but you want to want to merge near-duplicate strings, which may stem from errors, right?
Levenshtein distance with a threshold is probably what you need. You can try to accelerate this by using hashing techniques, for example.
Similarly, TF-IDF is the wrong tool. It's used for clustering texts, not strings. TF-IDF is the weight assigned to a single word (string; but it is assumed that this string does not contain spelling errors!) within a larger document. It doesn't work well on short documents, and it won't work at all on single-word strings.
I have encountered the same kind of problem. My approach was to create a graph where each string will be a node and each edge will connect two nodes with weight the similarity of those two strings. You can use edit distance or Sorensen for that. I also set a threshold of 0.2 so that my graph will not be complete thus very computationally heavy. After forming the graph you can use community detection algorithms to detect node communities. Each community is formed with nodes that have a lot of edges with each other, so they will be very similar with each other. You can use networkx or igraph to form the graph and identify each community. So each community will be a cluster of strings. I tested this approach with some string that I wanted to cluster. Here are some of the identified clusters.
University cluster
Council cluster
Committee cluster
I visualised the graph with the gephi tool.
Hope that helps even if it is quite late.