How to calculate dissimilarity matrix in Spark?

How to calculate dissimilarity matrix in Spark? - apache-spark

Is there any function or method that calculates dissimilarity matrix for a given data set? I've found All-pairs similarity via DIMSUM but it looks like it works for sparse data only. Mine is really dense.

Even though the original DIMSUM paper is talking about a matrix which:
each dimension is sparse with at most L nonzeros per row
And which values are:
the entries of A have been scaled to be in [−1, 1]
This is not a requirement and you can run it on a dense matrix. Actually if you check the sample code by the DIMSUM author from the databricks blog you'll notice that the RowMatrix is in fact created from an RDD of dense vectors:
// Load and parse the data file.
val rows = sc.textFile(filename).map { line =>
val values = line.split(' ').map(_.toDouble)
Vectors.dense(values)
}
val mat = new RowMatrix(rows)
Similarly the comment in CosineSimilarity Spark example gives as input a dense matrix which is not scaled.
You need to be aware that the only available method is the columnSimilarities(), which calculates similarities between columns. Hence if your input data file is structured in a way record = row, then you will have to do a matrix transpose first and then run the similarity. To answer your question, no there is no transpose on RowMatrix, other types of matrices in MLlib do have that feature so you'd have to do some transformations first.
Row similarity is in the works and did not make it into the newest Spark 1.5 unfortunately.
As for other options, you would have to implement them yourself. The naive brute force solution which requires O(mL^2) shuffles is very easy to implement (cartesian + your similiarity measure of choice) but performs very badly (speaking from experience).
You can also have a look at a different algorithm from the same person called DISCO but it's not implemented in Spark (and the paper also assumes L-sparsity).
Finally be advised that both DIMSUM and DISCO are estimates (although extremely good ones).

Related

Apache spark : How deep is the comparison of rows in RDD or DF

I want to understand the behavior of DF.intersect().
so the question came to mind, especially when we have complex Rows having complex fields. (deep tree)

If we are talking about dataframe intersect transformation, then, according to the Dataset documentation and source, the comparison is done directly on the encoded content. Which is as deep as it can possibly go.
def intersect(other: Dataset[T]): Dataset[T] Returns a new Dataset
containing rows only in both this Dataset and another Dataset. This is
equivalent to INTERSECT in SQL.
Since
1.6.0
Note: Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom equals
function defined on T.

Classification of unknown dataset into known categories

I have a number of datasets where I have an array of x, y, z coordinates of the endpoints of segments. First and second point represents a segment, so does third, fourth and so on...
The above data represents just a part of dataset... The entire dataset is a lot bigger.
I am required to train my machine with several datasets like this, so that it can predict the category of any unknown dataset further... The test dataset will also be the same as the above.
I need help with the approach. Which algorithm or approach can I use here to classify any unknown dataset into these known categories?

Its an unsupervised learning problem. If you know roughly in how many classes your data should be split use K-Means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
Otherwise, a combination of TSNE (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) and Kmeans usually works well. Basically transform data using TSNE and run Kmeans in transformed data.

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!

This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

Using SVD in pyspark

I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan' with Jordan Michael.
I am doing the following procedure using pyspark:
Calculate tfidf -> compute cos similarity -> convert to sparse matrix
calculate string distance matrix -> convert to dense matrix
element-wise multiplication between tfidf sparse matrix and string distance dense matrix to calculate the 'final similarity'
This works ok for 10000 names but I doubt about how long it will take to calculate a million names similarity as each matrix is 1000000x1000000 (As the matrices are symmetric I am taking only the upper triangle matrix but that doesn't change so much the high complexity time that is needed).
I have read that after computing the tfidf it is really useful to compute the SVD of the output matrices to reduce the dimensions. From the documentation I couldn't find an example of computeSVD for pyspark. It doesn't exist?
And how can SVD can help in my case to reduce the high memory and computational time?
Any feedback and ideas are welcome.

Just to update this, computeSVD is now availabe in the PySpark mllib API for RowMatrix and IndexedRowMatrix.
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix

I couldn't find an example of computeSVD for pyspark. It doesn't exist?
No, it doesn't. As for now (Spark 1.6.0 / Spark 2.0.0 SNAPSHOT) computeSVD is available only in Scala API. You can use solution provided by eliasah here:
Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
And how can SVD can help in my case to reduce the high memory and computational time?
It depends. If your data is simply as set of a very short (2-3 words) strings and you tokenize your data by simply splitting on whitespace it won't help you at all. It cannot improve brute force approach you use and your data is already extremely sparse.
If you process your data in some context or extract more complex features (ngrams for example) it can reduce the cost buts still won't help you with overall complexity.

All-pairs similarity using tfidf vectors in pyspark

I'm trying to find similar documents based on their text in spark. I'm using python with Spark.
So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf
Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.
We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.

I'd write this as a comment isntead of an answer but SO won't let me commet yet.
This would be "trivially" solved by utilizing ElasticSearch's more-like-this query. From docs you can see how it works and which factors are taken into account, which should be useful info even if you end up implementing this in Python.
They have also implemented other interesting algorithms such as the significant terms aggregation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string