Is it possible to create a distributed blockmatrix containing single precision entries in spark?
From what i gather from the documentation, the scala/java implementation of blockmatrix requires a mllib.Matrix object, which holds the values as doubles.
Is there any way around this limitation?
Background:
Im using GPU's to accelerate Sparks distributed matrix multiplication routines, and my GPU performs 20 times slower when multiplying double precision matrices rather than single precision matrices.
Related
I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.
As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".
My questions are:
Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
Is there some other consideration I've missed?
To be clear, I'm hoping the answer is either
"You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
"UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"
UDF are not efficient nor optimized, and are not transferred to jvm code especially if you use PySpark, there is pickle object created, OS spent lots of resources to transfer from jvm in/out. I have implemented something in pyspark using udf for geolocation and it would never finish in a few days on the other hand implemented in scala it has finished in a few hours.
Do it in scala if you have to do it.
Maybe that can help
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
I have to compute the smallest magnitude eigenvalue and it's associated eigenvector of a non symmetric matrix using PySpark libraries.
The size of
is very high and I want the computation to be distributed among the cluster's workers.
The problem is that i didn't find any API to compute eigenvalues in PySpark 2.3 documentation.
I have identified two paths, but I want to avoid them:
to reimplement eigen value decomposition trough QR algorithm using QRDecomposition available in PySpark API
to compute eigen value decomposition trough scala version class as described in this question on Stack Overflow
Is there a simpler or better way then this last two?
I already know the existence of this post, but they are conceptually different.
I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan' with Jordan Michael.
I am doing the following procedure using pyspark:
Calculate tfidf -> compute cos similarity -> convert to sparse matrix
calculate string distance matrix -> convert to dense matrix
element-wise multiplication between tfidf sparse matrix and string distance dense matrix to calculate the 'final similarity'
This works ok for 10000 names but I doubt about how long it will take to calculate a million names similarity as each matrix is 1000000x1000000 (As the matrices are symmetric I am taking only the upper triangle matrix but that doesn't change so much the high complexity time that is needed).
I have read that after computing the tfidf it is really useful to compute the SVD of the output matrices to reduce the dimensions. From the documentation I couldn't find an example of computeSVD for pyspark. It doesn't exist?
And how can SVD can help in my case to reduce the high memory and computational time?
Any feedback and ideas are welcome.
Just to update this, computeSVD is now availabe in the PySpark mllib API for RowMatrix and IndexedRowMatrix.
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix
I couldn't find an example of computeSVD for pyspark. It doesn't exist?
No, it doesn't. As for now (Spark 1.6.0 / Spark 2.0.0 SNAPSHOT) computeSVD is available only in Scala API. You can use solution provided by eliasah here:
Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
And how can SVD can help in my case to reduce the high memory and computational time?
It depends. If your data is simply as set of a very short (2-3 words) strings and you tokenize your data by simply splitting on whitespace it won't help you at all. It cannot improve brute force approach you use and your data is already extremely sparse.
If you process your data in some context or extract more complex features (ngrams for example) it can reduce the cost buts still won't help you with overall complexity.
Apache spark support sparse data.
For example, we can use MLUtils.loadLibSVMFile(...) to load data into an RDD.
I was wondering how does spark deal with those missing values.
Spark creates an RDD of Labeled points, and each labeled point has a label and a vector of features. Note that this is a Spark Vector which does support sparse elements (currently Sparse vectors are represented by an array of non-indices and a second array of doubles for each of the non-null value).
I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.