Spark flatMapToPair vs [filter + mapToPair] - apache-spark

What is the performance difference between the blocks of code below?
1.FlatMapToPair: This code block uses a single transformation, but is basically having the filter condition inside of it which returns an empty list, technically not allowing this element in the RDD to progress along
rdd.flatMapToPair(
if ( <condition> )
return Lists.newArrayList();
return Lists.newArrayList(new Tuple2<>(key, element));
)
2.[Filter + MapToPair] This code block has two transformations where the first transformation simply filters using the same condition as the above block of code but does another transformation mapToPair after the filter.
rdd.filter(
(element) -> <condition>
).mapToPair(
(element) -> new Tuple2<>(key, element)
)
Is Spark intelligent enough to perform the same with both these blocks of code regardless of the number of transformation OR perform worse in the code block 2 as these are two transformations?
Thanks

Actually Spark will perform worse in the first case because it has to initialize and then garbage collect new ArrayList for each record. Over a large number of records it can add substantial overhead.
Otherwise Spark is "intelligent enough" to use lazy data structures and combines multiple transformations which don't require shuffles into a single stage.
There are some situations where explicit merging of different transformations is beneficial (either to reduce number of initialized objects or to keep shorter lineage) but this is not one of these.

Related

Forcing pyspark join to occur sooner

PROBLEM: I have two tables that are vastly different in size. I want to join on some id by doing a left-outer join. Unfortunately, for some reason even after caching my actions after the join are being executed on all records even though I only want the ones from the left table. See below:
MY QUESTIONS:
1. How can I set this up so only the records that match the left table get processed through the costly wrangling steps?
LARGE_TABLE => ~900M records
SMALL_TABLE => 500K records
CODE:
combined = SMALL_TABLE.join(LARGE_TABLE SMALL_TABLE.id==LARGE_TABLE.id, 'left-outer')
print(combined.count())
...
...
# EXPENSIVE STUFF!
w = Window().partitionBy("id").orderBy(col("date_time"))
data = data.withColumn('diff_id_flag', when(lag('id').over(w) != col('id'), lit(1)).otherwise(lit(0)))
Unfortunately, my execution plan shows the expensive transformation operation above is being done on ~900M records. I find this odd since I ran df.count() to force the join to execute eagerly rather than lazily.
Any Ideas?
ADDITIONAL INFORMATION:
- note that the expensive transformation in my code flow occurs after the join (at least that is how I interpret it) but my DAG shows the expensive transformation occurring as a part of the join. This is exactly what I want to avoid as the transformation is expensive. I want the join to execute and THEN the result of that join to be run through the expensive transformation.
- Assume the smaller table CANNOT fit into memory.
The best way to do this is to broadcast the tiny dataframe. Caching is good for multiple actions, which doesnt seem to be applicable ro your particular use case.
df.count has no effect on the execution plan at all. It is just expensive operation executed without any good reason.
Window function application in this requires the same logic as join. Because you join by id and partitionBy idboth stages will require the same hash partitioning and full data scan for both sides. There is no acceptable reason to separate these two.
In practice join logic should be applied before window, serving as a filter for the the downstream transformations in the same stage.

'take' that transforms RDD

take(count) is an action on RDD, which returns an Array with first count items.
Is there a transformation that returns a RDD with first count items? (It is ok if count is approximate)
The best I can get is
val countPerPartition = count / rdd.getNumPartitions.toDouble
rdd.mapPartitions(_.take(countPerPartition))
Update:
I do not want data to be transfered to the driver. In my case, count may be quite large, and driver has not enough memory to hold it. I want the data to remain paralellized for further transformations.
Why not to rdd.map(..).take(X). I.e. transform and then take. Don't be afraid to do redundant work, until you call take all the computations are lazy evaluated in spark(so only ~X transformations will happen)

Understanding shuffle managers in Spark

Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful resources:
https://trongkhoanguyenblog.wordpress.com/
https://0x0fff.com/spark-architecture-shuffle/
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
Reading them, I understood there are different shuffle managers. I want to focus about two of them: hash manager and sort manager(which is the default manager).
For expose my question, I want to start from a very common transformation:
val rdd = reduceByKey(_ + _)
This transformation causes map-side aggregation and then shuffle for bringing all the same keys into the same partition.
My questions are:
Is Map-Side aggregation implemented using internally a mapPartition transformation and thus aggregating all the same keys using the combiner function or is it implemented with a AppendOnlyMap or ExternalAppendOnlyMap?
If AppendOnlyMap or ExternalAppendOnlyMap maps are used for aggregating, are they used also for reduce side aggregation that happens into the ResultTask?
What exaclty the purpose about these two kind of maps (AppendOnlyMap or ExternalAppendOnlyMap)?
Are AppendOnlyMap or ExternalAppendOnlyMap used from all shuffle managers or just from the sortManager?
I read that after AppendOnlyMap or ExternalAppendOnlyMap are full, are spilled into a file, how exactly does this steps happen?
Using the Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory is fill up, we start sorting map, spilling it to disk and then clean up the map, my question is : what is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system, but they are treat differently, Shuffle write records, are not put into the appendOnlyMap.
Can you explain in depth what happen when reduceByKey being executed, explaining me all the steps involved for to accomplish that? Like for example all the steps for map side aggregation, shuffling and so on.
It follows the description of reduceByKey step-by-step:
reduceByKey calls combineByKeyWithTag, with identity combiner and identical merge value and create value
combineByKeyWithClassTag creates an Aggregator and returns ShuffledRDD. Both "map" and "reduce" side aggregations use internal mechanism and don't utilize mapPartitions.
Agregator uses ExternalAppendOnlyMap for both combineValuesByKey ("map side reduction") and combineCombinersByKey ("reduce side reduction")
Both methods use ExternalAppendOnlyMap.insertAllMethod
ExternalAppendOnlyMap keeps track of spilled parts and the current in-memory map (SizeTrackingAppendOnlyMap)
insertAll method updates in-memory map and checks on insert if size estimated size of the current map exceeds the threshold. It uses inherited Spillable.maybeSpill method. If threshold is exceeded this method calls spill as a side effect, and insertAll initializes clean SizeTrackingAppendOnlyMap
spill calls spillMemoryIteratorToDisk which gets DiskBlockObjectWriter object from the block manager.
insertAll steps are applied for both map and reduce side aggregations with corresponding Aggregator functions with shuffle stage in between.
As of Spark 2.0 there is only sort based manager: SPARK-14667

Spark data manipulation with wholeTextFiles

I have 20k compressed files of ~2MB to manipulate in spark. My initial idea was to use wholeTextFiles() so that I get filename - > content tuples. This is useful because I need to maintain this kind of pairing (because the processing is done on a per file basis, with each file representing a minute of gathered data). However, whenever I need to map/filter/etc the data and to maintain this filename - > association, the code gets ugly (and perhaps not efficient?) i.e.
Data.map(lambda (x,y) : (x, y.changeSomehow))
The data itself, so the content of each file, would be nice to read as a separate RDD because it contains 10k's of lines of data; however, one cannot have an rdd of rdds (as far as i know).
Is there any way to ease the process? Any workaround that would basically allow me to use the content of each file as an rdd, hence allowing me to do rdd.map(lambda x: change(x)) without the ugly keeping track of filename (and usage of list comprehensions instead of transformations) ?
The goal of course is to also maintain the distributed approach and to not inhibit it in any way.
The last step of the processing will be to gather together everything through a reduce.
More background: trying to identify (near) ship collisions on a per minute basis, then plot their path
If you have normal map functions (o1->o2), you can use mapValues function. You've got also flatMap (o1 -> Collection()) function: flatMapValues.
It will keep Key (in your case - file name) and change only values.
For example:
rdd = sc.wholeTextFiles (...)
# RDD of i.e. one pair, /test/file.txt -> Apache Spark
rddMapped = rdd.mapValues (lambda x: veryImportantDataOf(x))
# result: one pair: /test/file.txt -> Spark
Using reduceByKey you can reduce results

How does the filter operation of Spark work on GraphX edges?

I'm very new to Spark and don't really know the basics, I just jumped into it to solve a problem. The solution for the problem involves making a graph (using GraphX) where edges have a string attribute. A user may wish to query this graph and I handle the queries by filtering out only those edges that have the string attribute which is equal to the user's query.
Now, my graph has more than 16 million edges; it takes more than 10 minutes to create the graph when I'm using all 8 cores of my computer. However, when I query this graph (like I mentioned above), I get the results instantaneously (to my pleasant surprise).
So, my question is, how exactly does the filter operation search for my queried edges? Does it look at them iteratively? Are the edges being searched for on multiple cores and it just seems very fast? Or is there some sort of hashing involved?
Here is an example of how I'm using filter: Mygraph.edges.filter(_.attr(0).equals("cat")) which means that I want to retrieve edges that have the attribute "cat" in them. How are the edges being searched?
How can the filter results be instantaneous?
Running your statement returns so fast because it doesn't actually perform the filtering. Spark uses lazy evaluation: it doesn't actually perform transformations until you perform an action which actually gathers the results. Calling a transformation method, like filter just creates a new RDD that represents this transformation and its result. You will have to perform an action like collect or count to actually have it executed:
def myGraph: Graph = ???
// No filtering actually happens yet here, the results aren't needed yet so Spark is lazy and doesn't do anything
val filteredEdges = myGraph.edges.filter()
// Counting how many edges are left requires the results to actually be instantiated, so this fires off the actual filtering
println(filteredEdges.count)
// Actually gathering all results also requires the filtering to be done
val collectedFilteredEdges = filteredEdges.collect
Note that in these examples the filter results are not stored in between: due to the laziness the filtering is repeated for both actions. To prevent that duplication, you should look into Spark's caching functionality, after reading up on the details on transformations and actions and what Spark actually does behind the scene: https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.
How exactly does the filter operation search for my queried edges (when I execute an action)?
in Spark GraphX the edges are stored in a an RDD of type EdgeRDD[ED] where ED is the type of your edge attribute, in your case String. This special RDD does some special optimizations in the background, but for your purposes it behaves like its superclass RDD[Edge[ED]] and filtering occurs like filtering any RDD: it will iterate through all items, applying the given predicate to each. An RDD however is split into a number of partitions and Spark will filter multiple partitions in parallel; in your case where you seem to run Spark locally it will do as many in parallel as the number of cores you have, or how much you have specified explicitly with --master local[4] for instance.
The RDD with edges is partitioned based on the PartitionStrategy that is set, for instance if you create your graph with Graph.fromEdgeTuples or by calling partitionBy on your graph. All strategies are based on the edge's vertices however, so don't have any knowledge about your attribute, and so don't affect your filtering operation, except maybe for some unbalanced network load if you'd run it on a cluster, all 'cat' edges end up in the same partition/executor and you do a collect or some shuffle operation. See the GraphX docs on Vertex and Edge RDDs for a bit more information on how graphs are represented and partitioned.

Resources