Spark - behaviour of first() operation - apache-spark

I'm trying to understand the jobs that get created by Spark for simple first() vs collect() operations.
Given the code:
myRDD = spark.sparkContext.parallelize(['A', 'B', 'C'])
def func(d):
return d + '-foo'
myRDD = myRDD.map(func)
My RDD is split across 16 partitions:
print(myRDD.toDebugString())
(16) PythonRDD[24] at RDD at PythonRDD.scala:48 []
| ParallelCollectionRDD[23] at parallelize at PythonRDD.scala:475 []
If I call:
myRDD.collect()
I get 1 job with 16 tasks created. I assume this is one task per partition.
However, if I call:
myRDD.first()
I get 3 jobs, with 1, 4, and 11 tasks created. Why have 3 jobs been created?
I'm running spark-2.0.1 with a single 16-core executor, provisioned by Mesos.

It is actually pretty smart Spark behaviour. Your map() is transformation (it is lazy-evaluated) and both first() and collect() are actions (terminal operations). All transformations are applied to the data in time you call actions.
When you call first() then spark tries to perform as low number of operations (transformations) as possible. First, it tries one random partition. If there are no results, it takes 4 times partitions more and calculates. Again, if there are no results, spark takes 4 times partitions more (5 * 4) and again tries to get any result.
In your case in this third try you have only 11 untouched partitions (16 - 1 - 4). If you have more data in RDD or less number of partitions, spark probably can find the first() result sooner.

Related

For-Loops in pyspark causes increasing dataframe size and failed job

I have a for loop in my pyspark code. When I test the code on around 5 loops it works fine. But when I run it on my core dataset which results in 160 loops, my pyspark job (submitted on an EMR cluster) fails. It first attempts it a second time before failing.
Below is a screenshot of the job runs in the Spark History Server:
The initial job Attempt ID 1 was run at 4:13pm and 4 hours later a second attempt Attempt ID 2 was done after which it failed. When I open up the jobs, I don't see any failed tasks or stages.
I am guessing it is because of the increasing size of the for loop.
Here is the stderr log of the output: It failed with status 1
Here is my pseudocode:
#Load Dataframe
df=spark.read.parquet("s3://path")
df=df.persist(StorageLevel.MEMORY_AND_DISK) # I will be using this df in the for loop
flist=list(df.select('key').distinct().toPandas()['key'])
output=[]
for i in flist:
df2=df.filter(col('key)==i))
Perform operations on df2 by each key that result in a dataframe df3
output.append(df3)
final_output = reduce(DataFrame.unionByName, output)
I think the output dataframe increments in size that the job eventually fails.
I am running 9 worker nodes with 8 vCores with 50GB of memory in each node.
Is there a way to write the output dataframe to a check point after a set number of loops, clear the memory and then continue the loops from where it left off in Spark?
EDIT:
My expected output is like so:
key mean prediction
3172742 0.0448 1
3172742 0.0419 1
3172742 0.0482 1
3172742 0.0471 1
3672767 0.0622 2
3672767 0.0551 2
3672767 0.0406 1
I can use groupBy function because I am performing a kmeans clustering and it doesnt allow groupBy. So I have to iterate over each key to perform the kmeans clustering.

Failure Handling of transformations in Spark

I read all the data into a pyspark dataframe from s3.
I apply the filter transform on the dataframe. And then write the dataframe to S3.
Lets say the dataframe had 10 partitions of 64MB each.
Now say for partition 1, 2, and 3 the filter and write were successful and there data was written to S3.
Now lets say for partition 4 the filter errors out.
What will happen after this. Will spark proceed for all the remaining partitions and leave partition 4, or will the program terminate after writing only 3 partitions?
Relevant parameter for non-local mode of operation is: spark.task.maxFailures.
If you have 32 tasks and 4 executors and 7 have run and 4 are running with 21 tasks waiting in that stage,
then if one of the 4 fails more times than spark.task.maxFailures after being re-scheduled,
then the Job will stop and no more stages will be executed.
the 3 running tasks will complete, but that's it.
A Job of multi-stages must stop, as a new stage can only start when all tasks of previous stage complete.
Transformations are all or none operations. In your case above, Spark will crash with errors from partition 4.

DAG and Spark execution

I am trying to get a better understanding of the Spark internals and I am not sure how to interpret the resulting DAG of a job.
Inspired to the example described at http://dev.sortable.com/spark-repartition/,
I run the following code in the Spark shell to obtain the list of prime numbers from 2 to 2 million.
val n = 2000000
val composite = sc.parallelize(2 to n, 8).map(x => (x, (2 to (n / x)))).flatMap(kv => kv._2.map(_ * kv._1))
val prime = sc.parallelize(2 to n, 8).subtract(composite)
prime.collect()
After executing I checked the SparkUI and observed the DAG in figure.
Now my question is: I call the function subtract only once, why does this operation appears
three times in the DAG?
Also, is there any tutorial that explains a bit how Spark creates these DAGs?
Thanks in advance.
subtract is a transformation which requires a shuffle:
First both RDDs have to be repartitioned using the same partitioner Local ("map-side") part of the transformation is marked as subtract in the stages 0 and 1. At this point both RDDs are converted to (item, null) pairs.
substract you see in the stage 2 happens after the shuffle when RDDs have been combined. This where items are filtered.
In general any operation which requires a shuffle will be executed in at least two stages (depending on the number of predecessors) and tasks belonging to each stage will be shown separately.

How are stages split into tasks in Spark?

Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor.
Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Are the tasks determined by the transformations and actions or can be multiple transformations/actions be in a task?
Are the tasks determined by the partition (e.g. one task per per stage per partition).
Are the tasks determined by the nodes (e.g. one task per stage per node)?
What I think (only partial answer, even if right)
In https://0x0fff.com/spark-architecture-shuffle, the shuffle is explained with the image
and I get the impression that the rule is
each stage is split into #number-of-partitions tasks, with no regard for the number of nodes
For my first image I'd say that I'd have 3 map tasks and 3 reduce tasks.
For the image from 0x0fff, I'd say there are 8 map tasks and 3 reduce tasks (assuming that there are only three orange and three dark green files).
Open questions in any case
Is that correct? But even if that is correct, my questions above are not all answered, because it is still open, whether multiple operations (e.g. multiple maps) are within one task or are separated into one tasks per operation.
What others say
What is a task in Spark? How does the Spark worker execute the jar file? and How does the Apache Spark scheduler split files into tasks? are similar, but I did not feel that my question was answered clearly there.
You have a pretty nice outline here. To answer your questions
A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations - e.g. blocks in HDFS or directories/volumes for a local file system.
Note that the submission of Stages is driven by the DAG Scheduler. This means that stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched.
We can see that in action in the following toy example in which we do the following types of operations:
load two datasources
perform some map operation on both of the data sources separately
join them
perform some map and filter operations on the result
save the result
So then how many stages will we end up with?
1 stage each for loading the two datasources in parallel = 2 stages
A third stage representing the join that is dependent on the other two stages
Note: all of the follow-on operations working on the joined data may be performed in the same stage because they must happen sequentially. There is no benefit to launching additional stages because they can not start work until the prior operation were completed.
Here is that toy program
val sfi = sc.textFile("/data/blah/input").map{ x => val xi = x.toInt; (xi,xi*xi) }
val sp = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spj = sfi.join(sp)
val sm = spj.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile("/data/blah/out")
And here is the DAG of the result
Now: how many tasks ? The number of tasks should be equal to
Sum of (Stage * #Partitions in the stage)
This might help you better understand different pieces:
Stage: is a collection of tasks. Same process running against
different subsets of data (partitions).
Task: represents a unit of
work on a partition of a distributed dataset. So in each stage,
number-of-tasks = number-of-partitions, or as you said "one task per
stage per partition”.
Each executer runs on one yarn container, and
each container resides on one node.
Each stage utilizes multiple executers, each executer is allocated multiple vcores.
Each vcore can execute exactly one task at a time
So at any stage, multiple tasks could be executed in parallel. number-of-tasks running = number-of-vcores being used.
If I understand correctly there are 2 ( related ) things that confuse you:
1) What determines the content of a task?
2) What determines the number of tasks to be executed?
Spark's engine "glues" together simple operations on consecutive rdds, for example:
rdd1 = sc.textFile( ... )
rdd2 = rdd1.filter( ... )
rdd3 = rdd2.map( ... )
rdd3RowCount = rdd3.count
so when rdd3 is (lazily) computed, spark will generate a task per partition of rdd1 and each task will execute both the filter and the map per line to result in rdd3.
The number of tasks is determined by the number of partitions. Every RDD has a defined number of partitions. For a source RDD that is read from HDFS ( using sc.textFile( ... ) for example ) the number of partitions is the number of splits generated by the input format. Some operations on RDD(s) can result in an RDD with a different number of partitions:
rdd2 = rdd1.repartition( 1000 ) will result in rdd2 having 1000 partitions ( regardless of how many partitions rdd1 had ).
Another example is joins:
rdd3 = rdd1.join( rdd2 , numPartitions = 1000 ) will result in rdd3 having 1000 partitions ( regardless of partitions number of rdd1 and rdd2 ).
( Most ) operations that change the number of partitions involve a shuffle, When we do for example:
rdd2 = rdd1.repartition( 1000 )
what actually happens is the task on each partition of rdd1 needs to produce an end-output that can be read by the following stage so to make rdd2 have exactly 1000 partitions ( How they do it? Hash or Sort ). Tasks on this side are sometimes referred to as "Map ( side ) tasks".
A task that will later run on rdd2 will act on one partition ( of rdd2! ) and would have to figure out how to read/combine the map-side outputs relevant to that partition. Tasks on this side are sometimes referred to as "Reduce ( side ) tasks".
The 2 questions are related: the number of tasks in a stage is the number of partitions ( common to the consecutive rdds "glued" together ) and the number of partitions of an rdd can change between stages ( by specifying the number of partitions to some shuffle causing operation for example ).
Once the execution of a stage commences, its tasks can occupy task slots. The number of concurrent task-slots is numExecutors * ExecutorCores. In general, these can be occupied by tasks from different, non-dependent stages.

Spark map is only one task while it should be parallel (PySpark)

I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form:
def doSomething(point,centers):
for center in centers.value:
if(distance(point,center)<1):
return(center)
return(None)
preppedData.map(lambda x:doSomething(x,centers)).take(5)
The preppedData RDD is cached and already evaluated, the doSomething function is represented a lot easier than it actually is but it's the same principle. The centers is a list that has been broadcast. Why is this map only in one task?
Similar pieces of code in other projects just map to +- 100 tasks and get run on all the executors, this one is 1 task on 1 executor. My job has 8 executors with 8 GB and 2 cores per executor available.
This could be due to the conservative nature of the take() method.
See the code in RDD.scala.
What it does is first take the first partition of your RDD (if your RDD doesn't require a shuffle, this will require only one task) and if there are enough results in that one partition, it will return that. If there is not enough data in your partition, it will then grow the number of partitions it tries to take until it gets the required number of elements.
Since your RDD is already cached, and your operation is only a map function, as long as any of your RDDs have >5 rows, this will only ever require one task. More tasks would be unnecessary.
This code exists to avoid overloading the driver with too much data by fetching from all partitions at once for a small take.

Resources