Spark intersection implementation - apache-spark

How does Spark implement intersection method? Does it require 2 RDDs to colocate on a single machine?
From here it says that it uses hashtables, which is a bit odd as it's probably not scalable and sorting both rdds and then comparing item by item might have provided a more scalable solution.
Any thoughts on the subject are welcome

It definitely doesn't need the RDDs to colocate on a single machine. You can just look at the code for the details. Looks like it uses a cogroup.

Related

Are there patterns for "lookbehind filters" in Apache Spark?

I stumbled at a couple of workloads which seem to require filtering data with "lookback" capability - mainly in IoT scenarios, where sensors can produce garbage data, and to detect that it's necessary to look at the previous record of that sensor.
Spark's filter() operation is obviously "element-only" - in fact, the RDD as a whole can't know the order of the elements you want it to look behind on. So another approach is needed.
My naive approach would involve keying the RDD on the sensor, repartitioning it so that keys and partitions become one and the same, and sorting all elements for the keys/partition so that they are in temporal order. Then we can filter with a user function and maybe emit the data back so that the rest of the pipeline can deal with it as it wishes.
However, this looks heavyweight and likely inefficient. Is there a more idiomatic way?
Summary: Is there a Spark-related design pattern to deal with filtering tasks which need to "lookback" at the previous element of a sequence?

pySpark: is it possible to groupBy() with one single node per group?

I'm using pySpark to compute per-group matrices. It looks like the computation would be faster if Spark stored any given group's rows on one single node, so Spark could compute each matrix locally. I'm afraid inter-node cooperation could take much longer.
Do map() and groupBy() usually achieve this kind of thing? Should I try to specify it as an option, if possible?
NB. The matrices include computing a distance between each row and the previous one, within each (sorted) group.
It seems Spark will do that by default.
See here : http://backtobazics.com/big-data/spark/apache-spark-groupby-example/
I guess you asked for mapPartitions(). Then the operation happens locally in each partition.

Spark broadcast variables: large maps

I am broadcasting a large Map (~6-10 GB). I am using sc.broadcast(prod_rdd) to do that. However, I am not sure whether broadcasting is meant only for small data/files and not for larger objects that I have. If former, what is the recommended practice? One option is to use a NoSQL database and then do the lookup using that. One issue with that is I might have to give up performance since I will be going through a single node (Region server or whatever equivalent of that is). If anyone has any insight into performance impact of these design choices, that will be greatly appreciated.
I'm wondering if you could perhaps use mapPartitions and read the map once per partition rather than broadcasting it?

Is Spark still advantageous for non-iterative analytics?

Spark uses in memory computing and caching to decrease latency on complex analytics, however this is mainly for "iterative algorythms",
If I needed to perform a more basic analytic, say perhaps each element was a group of numbers and I wanted to look for elements with a standard deviation less than 'x' would Spark still decrease latency compared to regular cluster computing (without in memory computing)? Assuming I used that same commodity hardware in each case.
It tied for the top sorting framework using none of those extra mechanisms, so I would argue that is reason enough. But, you can also run streaming, graphing, or machine learning without having to switch gears. Then, you add in that you should use DataFrames wherever possible and you get query optimizations beyond any other framework that I know of. So, yes, Spark is the clear choice in almost every instance.
One good thing about spark is its Datasource API combining it with SparkSQL gives you ability to query and join different data sources together. SparkSQL now includes decent optimizer - catalyst. As mentioned in one of the answer along with core (RDD) in spark you can also include streaming data, apply machine learning models and graph algorithms. So yes.

How to score all user-product combinations in Spark MatrixFactorizationModel?

Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?
Via the current API, once could pass a cartesian product of user-product to the predict function, but it seems to me that this will do a lot of extra processing.
Would accessing the private userFeatures, productFeatures be the correct approach, and if so, is there a good way to take advantage of other aspects of the framework to distribute this computation in an efficient way? Specifically, is there an easy way to do better than multiplying all pairs of userFeature, productFeature "by hand"?
Spark 1.1 has a recommendProducts method that can be mapped to each user ID. This is better than nothing but not really optimized for recommending to all users.
I would double-check that you really mean to make recommendations for everyone; at scale, this is inherently a big slow operation. Consider predicting for users that have been recently active only.
Otherwise, yes your best bet is to create your own method. The cartesian join of the feature RDDs is probably too slow as it's shuffling so many copies of the feature vectors. Choose the larger of the user / product feature set, and map that. In each worker, hold the other product / user feature set in memory in each worker. If this isn't feasible you can make this more complex and map several times against subsets of the smaller RDD in memory.
As of Spark 2.2, recommendProductsForUsers(num) would be the method.
Recommends the top "num" number of products for all users. The number of recommendations returned per user may be less than "num".
https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html

Resources