I want to implement SimRank using spark rdd interface. But my dataset is too large to process that the bipartite graph has hundreds of millions of nodes, so to find the similarity score of all neighborhood pairs is computationally expensive. I try to find some existing implementations but they all seems not to be scalable. Any suggestions?
I suggest to first take a look on the GraphX and Graphframes libraries that comes with the Apache Spark ecosystem and see if those fits your needs. They mostly bring in graph processing support on the top of RDD and Dataframes.
Related
I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.
As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".
My questions are:
Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
Is there some other consideration I've missed?
To be clear, I'm hoping the answer is either
"You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
"UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"
UDF are not efficient nor optimized, and are not transferred to jvm code especially if you use PySpark, there is pickle object created, OS spent lots of resources to transfer from jvm in/out. I have implemented something in pyspark using udf for geolocation and it would never finish in a few days on the other hand implemented in scala it has finished in a few hours.
Do it in scala if you have to do it.
Maybe that can help
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.
Explain the connection between libraries, such as SparkSQL, MLib, GraphX and Spark Streaming,and the core Spark platform
Basically, Spark is the base, an engine that allows the large-scale data processing with high performance. It provides an interface for programming with implicit data parallelism and fault tolerance.
GraphX, MLlib, Spark Streaming and Spark SQL are modules built on top of this engine, each of this has a different goal. Each of these libraries has new objects and functions that provide support for certain types of structures or features.
For example:
GraphX is a distributed graph processing module which allows representing a graph and applies efficient transformations, partitions and algorithms specialized for this kind of structure.
MLlib is a distributed machine learning module on top of Spark which implements certain algorithms like classification, regression, clustering,...
Spark SQL introduce the notion of DataFrames, the most important structure in this module, which allows applying SQL operations (e.g. select, where, groupBy, ...)
Spark Streaming is an extension of the core Spark which ingests data in mini-batches and performs transformations on those mini-batches of data. Spark Streaming has support built-in to consume from Kafka, Flume, and others platforms
You can combine these modules according to your need. For example, if you want to process a large graph for applying a clustering algorithm, then you can use the representation provided by GraphX and use MLlib for apply K-means on this representation.
Doc
I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.
I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)