How to get the partition results of machine learning in spark - apache-spark

I know that when I execute the LogisticRegression.fit(), this operation will split the data automatically in spark MLlib.
My first question :
For example: I want to split the data to 10 sets. Then, they are run a learning algorithm parallel. Then, I can get ten hypotheses, which can do the other algorithm to find a final better hypothesis.
The idea is that I split the data 10 sets randomly. Then I run LogisticRegression.fit() in each fold (for-loop?). Therefore, I can the each fold's results. However, I think it may not be a good way because of spending more time.
What is the better way when I want to get the each partition's results?
The other question is:
In LogisticRegressionModel class, can I say "the val weights is the hypothesis of the model." ?

Related

Can we refit or fit in in parts clustering algorithms?

I want to cluster big data set (more than 1M records).
I want to use dbscan or hdbscan algorithms for this clustering task.
When I try to use one of those algorithms, I'm getting memory error.
Is there a way to fit big data set in parts ? (go with for loop and refit every 1000 records) ?
If no, is there a better way to cluster big data set, without upgrading the machine memory ?
If the number of features in your dataset is not too much (below 20-25), you can consider using BIRCH. It's an iterative method that can be used for large datasets. In each iteration it builds a tree with only a small sample of data and put each instance into clusters.

Clustering with unknown number of clusters in Spark

I have a very large dataset (about 3.5M) of text messages. I am using tf-idf vector to represent each message in this dataset. I want to cluster the messages of the same topic together and I don't know the actual clusters or even the number of them.
So I searched a little and found that Optics, DBSCAN, or HDBSCAN can do this job but there is no implementation of them is spark ml or mllib. according to this In spark mllib there are implementations of K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA), Bisecting k-means and Streaming k-means.
So my problem is, all of them need K as an input and I don't have it. Is there any clustering algorithm implemented in Spark that find the number of clusters on its own?
Got a little bit too long for a comment. I'll try to explain it here.
Do you have the data on which topic a message belongs? Then you can simply do a group by that topic to group all the messages with similar topics.
That's one thing. And if you are trying to derive topics (K) from the dataset itself, then you need little more statistics to build a sound feature set to cluster them. Then you can come to a conclusion on K by varying it and find the best K with minimal error. There is a famous method called elbow method.
Check this out. https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/#:~:text=There%20is%20a%20popular%20method,fewer%20elements%20in%20the%20cluster.

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

Get Spark metrics on each iteration step?

Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.
After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
myModel.summary.objectiveHistory.foreach(...)
There's a nice example here in the Spark ML documentation -- once you know the right search terms.

Spark mllib shuffling the data

Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.
Is there a way to shuffle data before splitting it?
I think that you are confusing actual data shuffling with the random seed when splitting. If you set your split seed to a constant, let's say 11L per example, you'll always get the same splits.
And as stated by #zero323 Mllib simply takes a random sample by traversing each partition.
Is there a way to shuffle data before splitting it?
It depends on a context. You can always repartition or sort by random value but it is
Expensive
Requires some effort to avoid caching if you want to get different result each time
It is harder to get reproducible sample if you need one.
Thus my approach is to iterate and yield on the split seed. Which is the main principle of cross-validation. This way you can get the best seed according to evaluation step you are performing. And you have your reproducible sample, but this approach is quite expensive.
I hope this helps.

Resources