SUMMA (efficient distributed matrix multiplication) on Spark? - apache-spark

I'm trying to figure out if something like http://www.cs.utexas.edu/ftp/techreports/tr95-13.pdf is possible on Spark.
Is it possible to access low level RDD functionality/distribution in the same kind of way as with MPI (Key concept for SUMMA is 2D process topology and row/col broadcasts.)
I've seen simple matrix multiplication in Spark , but this doesn't seem to come close to SUMMA's efficiency.
Thanks!

Me and my schoolmates have accomplished a distributed matrix library on top of Spark: Marlin(https://github.com/PasaLab/marlin). The algorithm of Matrix multiplication implemented in our library refer to CARMA(http://www.eecs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf).
At first, we survey the SUMMA algorithm. However, sending submatrix along the processors row and column during each iteration is quite difficult to implement with Spark's API. Recently, we have implemented a mechanism simliar to MPI send and receive in Spark by using TorrentBroadcast, which needs to modify the Spark core code. I think with this strategy, It's possible to implement SUMMA in Spark. But the fault tolerance and scalability may be a problem.

Related

how to quantile-discretize on spark?

i want to quantile-discretize RDD[Float] to 10 pieces without Spark.ML, so i need to calculate 10th-Percentile, 20th-Percentile...80th-Percentile,90th-Percentile
data-set is very big, can't collect to local!
have any efficient algorithm to solve this problem?
There is already provided this capability is your are using Spark version > 2.0. You have to convert your RDD[Float] to a dataframe. Use approxQuantile(String col, double[] probabilities, double relativeError) from DataFrameStatFunctions.
From the documentation is says:
This method implements a variation of the Greenwald-Khanna algorithm
(with some speed optimizations). The algorithm was first present in
Space-efficient Online Computation of Quantile Summaries by Greenwald
and Khanna

Clustering of facebook-users with k-means

i got a facebook-list of user-ids from following page:
Stanford Facebook-Data
If you look at the facebook_combined data, you can see that it is a list of user-connections (edges). So for instance user 0 has something to do with user 1,2,3 and so on.
Now my work is to find clusters in the dataset.
In the first step i used node.js to read the file and save the data in an array like this:
array=[[0,1],[0,2], ...]
In the second step i used a k-means plugin for node.js to cluster the data:
Cluster-Plugin
But i dont know if the result is right, because now i get clusters of edges and not clusters of users.
UPDATE:
I am trying out a markov implementation for node js. The Markov Plugin however needs an adjacency matrix to build clusters. I implemented an algorithm with java to save the matrix in a file.
Maybe you got any other suggestion how i could get clusters out of edges.
K-means assumes your input data issue an R^d vector space.
In fact, it requires the data to be this way, because it computes means as cluster centers, hence the name k-means.
So if you want to use k-means, then you need
One row per datapoint (not an edge list)
A fixed dimensionality data space where the mean is a useful center (usually, you should have continuous attributes, on binary data the mean does not make too much sense) and where least-squares is a meaningful optimization criterion (again, on binary data, least-squares does not have a strong theoretical support)
On your Faceboook data, you could try some embedding, but I'd have doubts about the trustworthiness.

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.
My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Also, if you look at the Java code example there is also this
The number of columns should be small, e.g, less than 1000.
On the other hand, if you look at ML documentation, there are no limitations mentioned.
So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?
PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.
These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.
When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.
If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.
Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.
Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.
Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.

How to run parallel clustering using Amazon EMR / Spark from files in a S3

I have 200,000 points in an 1000-dimensional space.
If I load all these points using sc.textFile and exhaustively calculated the distance between each point, how can I do it in a parallel manner? Will Spark automatically parallelize the work for me?
Yes, Spark automatically parallelizes the work if you use it properly. Here's the spark introduction guide to get started.
To your use case, are you really looking to calculate the distance between all points? Calculating 40 billion numbers will be quite expensive. If you really wanted to do this, you'd likely want to use the cartesian function which will return a RDD of all pairs of the input data (eg 40 billion). Then you could calculate the distance on each pair with a map function.

Clustering huge data matrix in python?

I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix...
I think I can generate such a big table using pyTables but now - having such a table how will I cluster it?
I guess I can't just pass pyTables object to one of scikit learn clustering methods...
Are there any python based frameworks that would take my huge table and do something useful (lie clustering) with it? Perhaps in distributed manner?
Maybe you should look at algorithms that don't need a full distance matrix.
I know that it is popular to formulate algorithms as matrix operations, because tools such as R are rather fast at matrix operation (and slow on other things). But there is a whole ton of methods that don't require O(n^2) memory...
I think the main problem is memory. 1,5 x 1,5 million x 10B (1 element size) > 20TB
You can use bigdata database like pyTables, Hadoop http://en.wikipedia.org/wiki/Apache_Hadoop and MapReduce algorithm.
Here some guides: http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html
Or use Google App Engine Datastore with MapReduce https://developers.google.com/appengine/docs/python/dataprocessing/ - but now it isn't production version

Resources