I was going to through a review and it was stated:
forEach forces all of the data to be sent to a single process (the Driver)
which will cause issues (such as OutOfMemory issues) at scale. Instead the
map() function serves the same purpose and distributes processing across
different Worker nodes in the cluster.
Is it correct, I could not find any document which says forEach is not distributed while map is distributed!!!
I believe you talking about applying these functions on RDD or Dataset.
Nothing comes to driver in either case. All code is executed in executor. foreach is an action that returns nothing where as map() acts as transformer from one value to other.
def foreach(f: (T) ⇒ Unit): Unit
Applies a function f to all elements of this RDD.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Return a new RDD by applying a function to all elements of this RDD.
Related
I'm a beginner to Apache Spark. Spark's RDD API offers transformation functions like map, mapPartitions. I can understand that map works with each element from the RDD but mapPartitions works with each partition and many people have mentioned mapPartitions are ideally used where we want to do object creation/instantiation and provided examples like:
val rddData = sc.textFile("sample.txt")
val res = rddData.mapPartitions(iterator => {
//Do object instantiation here
//Use that instantiated object in applying the business logic
})
My question is can we not do that with map function itself by doing object instantiation outside the map function like:
val rddData = sc.textFile("sample.txt")
val obj = InstantiatingSomeObject
val res = rddData.map(element =>
//Use the instantiated object 'obj' and do something with data
)
I could be wrong in my fundamental understanding of map and mapPartitions and if the question is wrong, please correct me.
All objects that you create outside of your lambdas are created on the driver. For each execution of the lambda, they are sent over the network to the specific executor.
When calling map, the lambda is executed once per data element, causing to send your serialized object once per data element over the network. When using mapPartitions, this happens only once per partition. However, even when using mapPartions, it would usually be better to create the object inside of your lambda. In many cases, your object is not serializable (like a database connection for example). In this case you have to create the object inside your lambda.
Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful resources:
https://trongkhoanguyenblog.wordpress.com/
https://0x0fff.com/spark-architecture-shuffle/
https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md
Reading them, I understood there are different shuffle managers. I want to focus about two of them: hash manager and sort manager(which is the default manager).
For expose my question, I want to start from a very common transformation:
val rdd = reduceByKey(_ + _)
This transformation causes map-side aggregation and then shuffle for bringing all the same keys into the same partition.
My questions are:
Is Map-Side aggregation implemented using internally a mapPartition transformation and thus aggregating all the same keys using the combiner function or is it implemented with a AppendOnlyMap or ExternalAppendOnlyMap?
If AppendOnlyMap or ExternalAppendOnlyMap maps are used for aggregating, are they used also for reduce side aggregation that happens into the ResultTask?
What exaclty the purpose about these two kind of maps (AppendOnlyMap or ExternalAppendOnlyMap)?
Are AppendOnlyMap or ExternalAppendOnlyMap used from all shuffle managers or just from the sortManager?
I read that after AppendOnlyMap or ExternalAppendOnlyMap are full, are spilled into a file, how exactly does this steps happen?
Using the Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory is fill up, we start sorting map, spilling it to disk and then clean up the map, my question is : what is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system, but they are treat differently, Shuffle write records, are not put into the appendOnlyMap.
Can you explain in depth what happen when reduceByKey being executed, explaining me all the steps involved for to accomplish that? Like for example all the steps for map side aggregation, shuffling and so on.
It follows the description of reduceByKey step-by-step:
reduceByKey calls combineByKeyWithTag, with identity combiner and identical merge value and create value
combineByKeyWithClassTag creates an Aggregator and returns ShuffledRDD. Both "map" and "reduce" side aggregations use internal mechanism and don't utilize mapPartitions.
Agregator uses ExternalAppendOnlyMap for both combineValuesByKey ("map side reduction") and combineCombinersByKey ("reduce side reduction")
Both methods use ExternalAppendOnlyMap.insertAllMethod
ExternalAppendOnlyMap keeps track of spilled parts and the current in-memory map (SizeTrackingAppendOnlyMap)
insertAll method updates in-memory map and checks on insert if size estimated size of the current map exceeds the threshold. It uses inherited Spillable.maybeSpill method. If threshold is exceeded this method calls spill as a side effect, and insertAll initializes clean SizeTrackingAppendOnlyMap
spill calls spillMemoryIteratorToDisk which gets DiskBlockObjectWriter object from the block manager.
insertAll steps are applied for both map and reduce side aggregations with corresponding Aggregator functions with shuffle stage in between.
As of Spark 2.0 there is only sort based manager: SPARK-14667
What is the performance difference between the blocks of code below?
1.FlatMapToPair: This code block uses a single transformation, but is basically having the filter condition inside of it which returns an empty list, technically not allowing this element in the RDD to progress along
rdd.flatMapToPair(
if ( <condition> )
return Lists.newArrayList();
return Lists.newArrayList(new Tuple2<>(key, element));
)
2.[Filter + MapToPair] This code block has two transformations where the first transformation simply filters using the same condition as the above block of code but does another transformation mapToPair after the filter.
rdd.filter(
(element) -> <condition>
).mapToPair(
(element) -> new Tuple2<>(key, element)
)
Is Spark intelligent enough to perform the same with both these blocks of code regardless of the number of transformation OR perform worse in the code block 2 as these are two transformations?
Thanks
Actually Spark will perform worse in the first case because it has to initialize and then garbage collect new ArrayList for each record. Over a large number of records it can add substantial overhead.
Otherwise Spark is "intelligent enough" to use lazy data structures and combines multiple transformations which don't require shuffles into a single stage.
There are some situations where explicit merging of different transformations is beneficial (either to reduce number of initialized objects or to keep shorter lineage) but this is not one of these.
I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey
I have this function in the driver program which collects the result from rdds into an array and send it back. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong?
def runTopFunction() : Array[(String, Int)] = {
val topSearches = some function....
val summary = new ArrayBuffer[(String,Int)]()
topSearches.foreachRDD(rdd => {
summary = summary.++(rdd.collect())
})
return summary.toArray
}
So while the foreachRDD will do what you are looking to do, it is also non-blocking which means it won't wait until all of the stream is processed. Since you cal toArray on your buffer right after the call to foreachRDD, there won't have been any elements processed yet.
DStream.forEachRDD is an action on given DStream and will be scheduled for execution on each streaming batch interval. It's a declarative construction of the job to be executed later on.
Accumulating over the values in this way is not supported because while the Dstream.forEachRDD is just saying "do this on each iteration", the surrounding accumulation code is executed immediately, resulting in an empty array.
Depending of what happens to the summary data after it's calculated, there're few options on how to implement this:
If the data needs to be retrieved by another process, use a shared thread-safe structure. A priority queue is great for top-k uses.
If the data will be stored (fs, db), you can just write to the storage after applying the topSearches function to the dstream.