How does Spark manage stages? - apache-spark

I am trying to understand how jobs and stages are defined in spark, and for that I am now using the code that I found here and spark UI. In order to see it on spark UI I had to copy and paste the text on the files several times so it takes more time to process.
Here is the output of spark UI:
Now, I understand that there are three jobs because there are three actions and also that the stages are generated because of shuffle actions, but what I don't understand is why in the Job 1 stages 4, 5 and 6 are the same as stages 0, 1 and 2 of Job 0 and the same happens for Job 2.
How can I know what stages will be in more than a job only seeing the java code (before executing anything)? And also, why are stage 4 and 9 skipped and how can I know it will happen before executing?

I understand that there are three jobs because there are three actions
I'd even say that there could have been more Spark jobs but the minimum number is 3. It all depends on the implementation of transformations and the action used.
I don't understand is why in the Job 1 stages 4, 5 and 6 are the same as stages 0, 1 and 2 of Job 0 and the same happens for Job 2.
Job 1 is the result of some action that ran on a RDD, finalRdd. That RDD was created using (going backwards): join, textFile, map, and distinct.
val people = sc.textFile("people.csv").map { line =>
val tokens = line.split(",")
val key = tokens(2)
(key, (tokens(0), tokens(1))) }.distinct
val cities = sc.textFile("cities.csv").map { line =>
val tokens = line.split(",")
(tokens(0), tokens(1))
}
val finalRdd = people.join(cities)
Run the above and you'll see the same DAG.
Now, when you execute leftOuterJoin or rightOuterJoin actions, you'll get the other two DAGs. You're using the previously-used RDDs to run new Spark jobs and so you'll see the same stages.
why are stage 4 and 9 skipped
Often, Spark will skip execution of some stages. The grayed-out stages are ones already computed so Spark will reuse them and so make performance better.
How can I know what stages will be in more than a job only seeing the java code (before executing anything)?
That's what RDD lineage (graph) offers.
scala> people.leftOuterJoin(cities).toDebugString
res15: String =
(3) MapPartitionsRDD[99] at leftOuterJoin at <console>:28 []
| MapPartitionsRDD[98] at leftOuterJoin at <console>:28 []
| CoGroupedRDD[97] at leftOuterJoin at <console>:28 []
+-(2) MapPartitionsRDD[81] at distinct at <console>:27 []
| | ShuffledRDD[80] at distinct at <console>:27 []
| +-(2) MapPartitionsRDD[79] at distinct at <console>:27 []
| | MapPartitionsRDD[78] at map at <console>:24 []
| | people.csv MapPartitionsRDD[77] at textFile at <console>:24 []
| | people.csv HadoopRDD[76] at textFile at <console>:24 []
+-(3) MapPartitionsRDD[84] at map at <console>:29 []
| cities.csv MapPartitionsRDD[83] at textFile at <console>:29 []
| cities.csv HadoopRDD[82] at textFile at <console>:29 []
As you can see yourself, you will end up with 4 stages since there are 3 shuffle dependencies (the edges with the numbers of partitions).
Numbers in the round brackets are the number of partitions that DAGScheduler will eventually use to create task sets with the exact number of tasks. One TaskSet per stage.

Related

In which situations are the stages of DAG skipped?

I am trying to find the situations in which Spark would skip stages in case I am using RDDs. I know that it will skip stages if there is a shuffle operation happening. So, I wrote the following code to see if it is true:
def main(args: Array[String]): Unit =
{
val conf = new SparkConf().setMaster("local").setAppName("demo")
val sc = new SparkContext(conf)
val d = sc.parallelize(0 until 1000000).map(i => (i%100000, i))
val c=d.rightOuterJoin(d.reduceByKey(_+_)).collect
val f=d.leftOuterJoin(d.reduceByKey(_+_)).collect
val g=d.join(d.reduceByKey(_ + _)).collect
}
On inspecting Spark UI, I am getting the following jobs with its stages:
I was expecting stage 3 and stage 6 to be skipped as these were using the same RDD to compute the required joins(given the fact that in case of shuffle, spark automatically caches data). Can anyone please explain why am I not seeing any skipped stages here? And how can I modify the code to see the skipped stages? And are there any other situations(apart from shuffling) when Spark is expected to skip stages?
Actually, it is very simple.
In your case nothing can be skipped as each Action has a different JOIN type. It needs to scan d and d' to compute the result. Even with .cache (which you do not use and should use to avoid recomputing all the way back to source on each Action), this would make no difference.
Looking at this simplified version:
val d = sc.parallelize(0 until 100000).map(i => (i%10000, i)).cache // or not cached, does not matter
val c=d.rightOuterJoin(d.reduceByKey(_+_))
val f=d.leftOuterJoin(d.reduceByKey(_+_))
c.count
c.collect // skipped, shuffled
f.count
f.collect // skipped, shuffled
Shows the following Jobs for this App:
(4) Spark Jobs
Job 116 View(Stages: 3/3)
Job 117 View(Stages: 1/1, 2 skipped)
Job 118 View(Stages: 3/3)
Job 119 View(Stages: 1/1, 2 skipped)
You can see that successive Actions based on same shuffling result cause a skipping of one or more Stages for the second Action / Job for val c or val f. That is to say, the join type for c and f are known and the 2 Actions for the same join type run sequentially profiting from prior work, i.e. the second Action can rely on the shuffling of the first Action that is directly applicable to the 2nd Action. That simple.

Spark driver memory for rdd.saveAsNewAPIHadoopFile and workarounds

I'm having issues with a particular spark method, saveAsNewAPIHadoopFile. The context is that I'm using pyspark, indexing RDDs with 1k, 10k, 50k, 500k, 1m records into ElasticSearch (ES).
For a variety of reasons, the Spark context is quite underpowered with a 2gb driver, and single 2gb executor.
I've had no problem until about 500k, when I'm getting java heap size problems. Increasing the spark.driver.memory to about 4gb, and I'm able to index more. However, there is a limit to how long this will work, and we would like to index in upwards of 500k, 1m, 5m, 20m records.
Also constrained to using pyspark, for a variety of reasons. The bottleneck and breakpoint seems to be a spark stage called take at SerDeUtil.scala:233, that no matter how many partitions the RDD has going in, it drops down to one, which I'm assuming is the driver collecting the partitions and preparing for indexing.
Now - I'm wondering if there is an efficient way to still use an approach like the following, given that constraint:
to_index_rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource":"%s/record" % index_name,
"es.nodes":"192.168.45.10:9200",
"es.mapping.exclude":"temp_id",
"es.mapping.id":"temp_id",
}
)
In pursuit of a good solution, I might as well air some dirty laundry. I've got a terribly inefficient workaround that uses zipWithIndex to chunk an RDD, and send those subsets to the indexing function above. Looks a bit like this:
def index_chunks_to_es(spark=None, job=None, kwargs=None, rdd=None, chunk_size_limit=10000):
# zip with index
zrdd = rdd.zipWithIndex()
# get count
job.update_record_count(save=False)
count = job.record_count
# determine number of chunks
steps = count / chunk_size_limit
if steps % 1 != 0:
steps = int(steps) + 1
# evenly distribute chunks, while not exceeding chunk_limit
dist_chunk_size = int(count / steps) + 1
# loop through steps, appending subset to list for return
for step in range(0, steps):
# determine bounds
lower_bound = step * dist_chunk_size
upper_bound = (step + 1) * dist_chunk_size
print(lower_bound, upper_bound)
# select subset
rdd_subset = zrdd.filter(lambda x: x[1] >= lower_bound and x[1] < upper_bound).map(lambda x: x[0])
# index to ElasticSearch
ESIndex.index_job_to_es_spark(
spark,
job=job,
records_df=rdd_subset.toDF(),
index_mapper=kwargs['index_mapper']
)
It's slow, if I'm understanding correctly, because that zipWithIndex, filter, and map are evaluated for each resulting RDD subset. However, it's also memory efficient in that 500k, 1m, 5m, etc. records are never sent to saveAsNewAPIHadoopFile, instead, these smaller RDDs that a relatively small spark driver can handle.
Any suggestions for different approaches would be greatly appreciated. Perhaps that means now using the Elasticsearch-Hadoop connector, but instead sending raw JSON to ES?
Update:
Looks like I'm still getting java heap space errors with this workaround, but leaving here to demonstrate thinking for a possible workaround. Had not anticipated that zipWithIndex would collect everything on the driver (which I'm assuming is the case here)
Update #2
Here is a debug string of the RDD I'ma attempting to run through saveAsNewAPIHadoopFile:
(32) PythonRDD[6] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[4] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[3] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(1) MapPartitionsRDD[2] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[1] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[0] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Update #3
Below is a DAG visualization for the take at SerDeUtil.scala:233 that appears to run on driver/localhost:
And a DAG for the saveAsNewAPIHadoopFile for a much smaller job (around 1k rows), as the 500k rows attempts never actually fire as the SerDeUtil stage above is what appears to trigger the java heap size problem for larger RDDs:
I'm still a bit confused as to why this addresses the problem, but it does. When reading rows from MySQL with spark.jdbc.read, by passing bounds, the resulting RDD appears to be partitioned in such a way that saveAsNewAPIHadoopFile is successful for large RDDs.
Have a Django model for the DB rows, so can get first and last column IDs:
records = records.order_by('id')
start_id = records.first().id
end_id = records.last().id
Then, pass those to spark.read.jdbc:
sqldf = spark.read.jdbc(
settings.COMBINE_DATABASE['jdbc_url'],
'core_record',
properties=settings.COMBINE_DATABASE,
column='id',
lowerBound=bounds['lowerBound'],
upperBound=bounds['upperBound'],
numPartitions=settings.SPARK_REPARTITION
)
The debug string for the RDD shows that the originating RDD now has 10 partitions:
(32) PythonRDD[11] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[10] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[9] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[8] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(10) MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[6] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Where my understanding breaks down, is that you can see there is a manual/explicit repartitioning to 32, both in the debug string from the question, and this one above, which I thought would be enough to ease memory pressure on the saveAsNewAPIHadoopFile call, but apparently the Dataframe (turned into an RDD) from the original spark.jdbc.read matters even downstream.

How does lineage get passed down in RDDs in Apache Spark

Do each RDD point to the same lineage graph
or
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?
Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on an RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method.
Consider the following example:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
Executing toDebugString() on splitedLines RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
For more information about how Spark works internally, please read my another post
When a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation.
A lineage will keep track of what all transformations has to be applied on that RDD,
including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately,
it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it.
RDD lineage information can also be printed by using the command filteredRdd.toDebugString(filteredRdd is the RDD here).
Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

Spark order of execution of tasks inside of a stage and order of execution of stages

I'm newbie in spark I would like to understand some basic mechanism on how it works behind the scenes.
I have attached the lineage of my RDD and I have the following questions:
Why do I have 8 stages instead of 5? From the book Learning from Spark (Chapter 8 http://bit.ly/1E0Hah7), I could understand that "RDDs that exist at the same level of indentation as their
parents will be pipelined [into same physical stage] during physical execution". Since I have 5 parents, I'm expected to have 5 stages. Still the Spark UI stages view, shows 8 stages.
Also what represents the (8) represented in the debug string? Is any bug in this function?
At the stage level, what is the execution order among the tasks?
They can be executed all of them in parallel
HadoopRDD[0] || MappedRDD[1] || MapPartitionsRDD[4] || ZippedWithIndexRDD[6]
or they are waiting each task upon the other to complete
HadoopRDD[0]=>completed=>MappedRDD[1]=>completed=>etc ?
Between stages, the order is given by the execution plan, so each stage is waiting till the ones before are completed. Is this a correct assumption?
I look forward for your answers.
Regards,
Florin
(8) MappedRDD[21] at map at WAChunkSepvgFilterNewModel.scala:298 []
| MappedRDD[20] at map at WAChunkSepvgFilterNewModel.scala:182 []
| ShuffledRDD[19] at sortByKey at WAChunkSepvgFilterNewModel.scala:182 []
+-(8) ShuffledRDD[16] at aggregateByKey at WAChunkSepvgFilterNewModel.scala:182 []
+-(8) FlatMappedRDD[15] at flatMap at WAChunkSepvgFilterNewModel.scala:174 []
| ZippedWithIndexRDD[14] at zipWithIndex at WAChunkSepvgFilterNewModel.scala:174 []
| MappedRDD[13] at map at WAChunkSepvgFilterNewModel.scala:272 []
| MappedRDD[12] at map at WAChunkSepvgFilterNewModel.scala:161 []
| ShuffledRDD[11] at sortByKey at WAChunkSepvgFilterNewModel.scala:161 []
+-(8) ShuffledRDD[8] at aggregateByKey at WAChunkSepvgFilterNewModel.scala:161 []
+-(8) FlatMappedRDD[7] at flatMap at WAChunkSepvgFilterNewModel.scala:153 []
| ZippedWithIndexRDD[6] at zipWithIndex at WAChunkSepvgFilterNewModel.scala:153 []
| MappedRDD[5] at map at WAChunkSepvgFilterNewModel.scala:248 []
| MapPartitionsRDD[4] at mapPartitionsWithIndex at WAChunkSepvgFilterNewModel.scala:114 []
| test4spark.csv MappedRDD[1] at textFile at WAChunkSepvgFilterNewModel.scala:215 []
| test4spark.csv HadoopRDD[0] at textFile at WAChunkSepvgFilterNewModel.scala:215 []
(8) Represents the # of partitions in the rdd.
Regarding Point 3,your assumption is correct.Once all the tasks belonging to a stage is completed only then the next stages tasks are started

Apache spark applying map transformation on RDDs

I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like :
HadoopRDD -> RDD1 -> RDD2.
My question is whether Spak will iterate over the HadoopRDD record by record to generate RDD1 then it will iterate over RDD1 record by record to generate RDD2 or does it ietrate over HadoopRDD and then generate RDD1 and then RDD2 in one go.
Short answer: rdd.map(f).map(g) will be executed in one pass.
tl;dr
Spark splits a job into stages. A stage applied to a partition of data is a task.
In a stage, Spark will try to pipeline as many operations as possible. "Possible" is determined by the need to rearrange data: an operation that requires a shuffle will typically break the pipeline and create a new stage.
In practical terms:
Given `rdd.map(...).map(..).filter(...).sort(...).map(...)`
will result in two stages:
.map(...).map(..).filter(...)
.sort(...).map(...)
This can be retrieved from an rdd using rdd.toDebugString
The same job example above will produce this output:
val mapped = rdd.map(identity).map(identity).filter(_>0).sortBy(x=>x).map(identity)
scala> mapped.toDebugString
res0: String =
(6) MappedRDD[9] at map at <console>:14 []
| MappedRDD[8] at sortBy at <console>:14 []
| ShuffledRDD[7] at sortBy at <console>:14 []
+-(8) MappedRDD[4] at sortBy at <console>:14 []
| FilteredRDD[3] at filter at <console>:14 []
| MappedRDD[2] at map at <console>:14 []
| MappedRDD[1] at map at <console>:14 []
| ParallelCollectionRDD[0] at parallelize at <console>:12 []
Now, coming to the key point of your question: pipelining is very efficient. The complete pipeline will be applied to each element of each partition once. This means that rdd.map(f).map(g) will perform as fast as rdd.map(f andThen g) (with some neglectable overhead)
Apache Spark will iterate over the HadoopRDD record by record in no specific order (data will be split and sent to the workers) and "apply" the first transformation to compute RDD1. After that, the second transformation is applied to each element of RDD1 to get RDD2, again in no specific order, and so on for successive transformations. You can notice it from the map method signature:
// Return a new RDD by applying a function to all elements of this RDD.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Apache Spark follows a DAG (Directed Acyclic Graph) execution engine. It won't actually trigger any computation until a value is required, so you have to distinguish between transformations and actions.
EDIT:
In terms of performance, I am not completely aware of the underlying implementation of Spark, but I understand there shouldn't be a significant performance loss other than adding extra (unnecessary) tasks in the related stage. From my experience, you don't normally use transformations of the same "nature" successively (in this case two successive map's). You should be more concerned of performance when shuffling operations take place, because you are moving data around and this has a clear impact on your job performance. Here you can find a common issue regarding that.

Resources