how to quantile-discretize on spark? - apache-spark

i want to quantile-discretize RDD[Float] to 10 pieces without Spark.ML, so i need to calculate 10th-Percentile, 20th-Percentile...80th-Percentile,90th-Percentile
data-set is very big, can't collect to local!
have any efficient algorithm to solve this problem?

There is already provided this capability is your are using Spark version > 2.0. You have to convert your RDD[Float] to a dataframe. Use approxQuantile(String col, double[] probabilities, double relativeError) from DataFrameStatFunctions.
From the documentation is says:
This method implements a variation of the Greenwald-Khanna algorithm
(with some speed optimizations). The algorithm was first present in
Space-efficient Online Computation of Quantile Summaries by Greenwald
and Khanna

Related

suggestion for clustering algorithm?

I have a dataset of 590000 records after preprocessing and i wanted to find clusters out of it and it contains string data (for now assume i have only one column with 590000 unique values in dataset). Also i am using custom defined distance measure and needed to calculate the distance matrix of size 590000*590000. Using some partition logic i created the distance matrix but cannot merge those partitions into one big distance matrix due to memory constarints. Does anyone have any sort of idea to resolve it ?? I picked DBSCAN for it. Is there any way to use deep learning methodologies?? any other ideas
Use a manageable sample first.
Because I doubt the results will be good enough to warrant any effort on scaling a method that does not work anyway.

Find top K cosine similar vectors to a given vector efficiently

The problem:
Suppose I have a group of around 1,000,000 short documents D (no more than 50 words each), and I want to let users to supply a document from the same group D, and and get the top K similar documents from D.
My approach:
My first approach was to preprocess the group D by applying simple tf-idf, and after I have vector for each document, which is extremely sparse, to use a simple nearest neighbours algorithm based on cosine similarity.
Then, on query time, to justuse my static nearest neighbours table which its size is 1,000,000 x K, without any further calculations.
After applying tf-idf, I got vectors in size ~200,000, which means now I have a very sparse table (that can be stored efficiently in memory using sparse vectors) in size 1,000,000 x 200,000.
However, calculating the nearest neighbours model took me more than one day, and still haven't finished.
I tried to lower the vectors dimension by applying HashingTF, that utilizes the hasing trick, instead, so I can set the dimension to a constant one (in my case, i used 2^13 for uninfied hashing), but still I get the same bad performance.
Some technical information:
I use Spark 2.0 for the tf-idf calculation, and sklearn NearestNeighbours on the collected data.
Is thier any more efficient way to achieve that goal?
Thanks in advance.
Edit:
I had an idea to try a LSH based approximation similarity algorithm like those implemented in spark as described here, but could not find one that supports the 'cosine' similarity metric.
There were some requirements for the algorithm on the relation between training instances and the dimensions of your vectors , but you can try DIMSUM.
You can find the paper here.

PCA in Spark MLlib and Spark ML

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.
My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
Also, if you look at the Java code example there is also this
The number of columns should be small, e.g, less than 1000.
On the other hand, if you look at ML documentation, there are no limitations mentioned.
So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?
PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.
These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.
When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.
If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.
Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.
Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.
Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.

Text Documents Clustering - Non Uniform Clusters

I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.
I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior
Does anyone know what kind of problem am I running into?
Here are the possible things that might be going "wrong":
Your k-means cluster initialization points are chosen as the same set of points in each run. I recommend using the 'random' for the init parameter of k-means http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. If that doesn't work then supply to k-means your own set of random initial cluster centers. Remember to initialize your random generator using its seed() method as the current date and time. https://docs.python.org/2/library/random.html uses current date-time as the default value.
Your distance function, i.e. euclidean distance might be the culprit. This is less likely but it is always good to run k-means using cosine similarity especially when you are using it for document similarity. scikits doesn't have this functionality at present but you should look here: Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
These two combined should give you good clusters.
I noticed with the help of above answers and comments that there was a problem with outliers and noise in original space. For this, we should use a dimensionality reduction method which eliminates the unwanted noise in the data. I tried random projections first but it failed to work with text data, simply because the problem was still not solved.
Then using Truncated Singular Value Decomposition, I was able to get perfect uniform clusters. Hence, the Truncated SVD is the way to go with textual data in my opinion.

SUMMA (efficient distributed matrix multiplication) on Spark?

I'm trying to figure out if something like http://www.cs.utexas.edu/ftp/techreports/tr95-13.pdf is possible on Spark.
Is it possible to access low level RDD functionality/distribution in the same kind of way as with MPI (Key concept for SUMMA is 2D process topology and row/col broadcasts.)
I've seen simple matrix multiplication in Spark , but this doesn't seem to come close to SUMMA's efficiency.
Thanks!
Me and my schoolmates have accomplished a distributed matrix library on top of Spark: Marlin(https://github.com/PasaLab/marlin). The algorithm of Matrix multiplication implemented in our library refer to CARMA(http://www.eecs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf).
At first, we survey the SUMMA algorithm. However, sending submatrix along the processors row and column during each iteration is quite difficult to implement with Spark's API. Recently, we have implemented a mechanism simliar to MPI send and receive in Spark by using TorrentBroadcast, which needs to modify the Spark core code. I think with this strategy, It's possible to implement SUMMA in Spark. But the fault tolerance and scalability may be a problem.

Resources