I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan' with Jordan Michael.
I am doing the following procedure using pyspark:
Calculate tfidf -> compute cos similarity -> convert to sparse matrix
calculate string distance matrix -> convert to dense matrix
element-wise multiplication between tfidf sparse matrix and string distance dense matrix to calculate the 'final similarity'
This works ok for 10000 names but I doubt about how long it will take to calculate a million names similarity as each matrix is 1000000x1000000 (As the matrices are symmetric I am taking only the upper triangle matrix but that doesn't change so much the high complexity time that is needed).
I have read that after computing the tfidf it is really useful to compute the SVD of the output matrices to reduce the dimensions. From the documentation I couldn't find an example of computeSVD for pyspark. It doesn't exist?
And how can SVD can help in my case to reduce the high memory and computational time?
Any feedback and ideas are welcome.
Just to update this, computeSVD is now availabe in the PySpark mllib API for RowMatrix and IndexedRowMatrix.
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix
I couldn't find an example of computeSVD for pyspark. It doesn't exist?
No, it doesn't. As for now (Spark 1.6.0 / Spark 2.0.0 SNAPSHOT) computeSVD is available only in Scala API. You can use solution provided by eliasah here:
Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
And how can SVD can help in my case to reduce the high memory and computational time?
It depends. If your data is simply as set of a very short (2-3 words) strings and you tokenize your data by simply splitting on whitespace it won't help you at all. It cannot improve brute force approach you use and your data is already extremely sparse.
If you process your data in some context or extract more complex features (ngrams for example) it can reduce the cost buts still won't help you with overall complexity.
Related
I am trying to build a random forest on a slightly large data set - half million rows and 20K columns (dense matrix).
I have tried modifying the hyperparameters such as:
n_jobs = -1 or iterating over max depth. However it's either getting stopped because of a memory issue (I have a 320GB server) or the accuracy is very low (when i use a lower max_depth)
Is there a way where I can still use all the features and build the model without any memory issue or not loosing on accuracy?
In my opinion (don't know exactly your case and dataset) you should focus on extract information from your dataset, especially if you have 20k of columns. I assume some of them will not give much variance or will be redundant, so you can make you dataset slightly smaller and more robust to potential overfit.
Also, you should try to use some dimensionality reduction methods which will allows you to make your dataset smaller retaining the most of the variance.
sample code for pca
pca gist
PCA for example (did not mean to offend you if you already know this methods)
pca wiki
I am using k-means on a dataset including more than 150k documents but i don't know what a good k value is.
I have tried elbow method to find it but the inertia value doesn't change so much.(i am using sklearn).
here is the
If elbow method does not have a clear answer, then possibly no number of clusters is particularly good. k-means can only model spherical relationships, which might be limiting. You can maybe try other feature representations, such as something based on Word Embeddings.
For a document grouping task, you might want to use a topic modelling approach instead of clustering, like Latent Dirichlet Allocation (LDA) or Non-negative Matrix factorization (NMF).
Is it possible to create a distributed blockmatrix containing single precision entries in spark?
From what i gather from the documentation, the scala/java implementation of blockmatrix requires a mllib.Matrix object, which holds the values as doubles.
Is there any way around this limitation?
Background:
Im using GPU's to accelerate Sparks distributed matrix multiplication routines, and my GPU performs 20 times slower when multiplying double precision matrices rather than single precision matrices.
Is there any function or method that calculates dissimilarity matrix for a given data set? I've found All-pairs similarity via DIMSUM but it looks like it works for sparse data only. Mine is really dense.
Even though the original DIMSUM paper is talking about a matrix which:
each dimension is sparse with at most L nonzeros per row
And which values are:
the entries of A have been scaled to be in [−1, 1]
This is not a requirement and you can run it on a dense matrix. Actually if you check the sample code by the DIMSUM author from the databricks blog you'll notice that the RowMatrix is in fact created from an RDD of dense vectors:
// Load and parse the data file.
val rows = sc.textFile(filename).map { line =>
val values = line.split(' ').map(_.toDouble)
Vectors.dense(values)
}
val mat = new RowMatrix(rows)
Similarly the comment in CosineSimilarity Spark example gives as input a dense matrix which is not scaled.
You need to be aware that the only available method is the columnSimilarities(), which calculates similarities between columns. Hence if your input data file is structured in a way record = row, then you will have to do a matrix transpose first and then run the similarity. To answer your question, no there is no transpose on RowMatrix, other types of matrices in MLlib do have that feature so you'd have to do some transformations first.
Row similarity is in the works and did not make it into the newest Spark 1.5 unfortunately.
As for other options, you would have to implement them yourself. The naive brute force solution which requires O(mL^2) shuffles is very easy to implement (cartesian + your similiarity measure of choice) but performs very badly (speaking from experience).
You can also have a look at a different algorithm from the same person called DISCO but it's not implemented in Spark (and the paper also assumes L-sparsity).
Finally be advised that both DIMSUM and DISCO are estimates (although extremely good ones).
I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.