I upgraded spark version from 2.4 to 3.0.1 and I'm facing problems with time execution. It is 40 times slower than before.
The problem seems to be in the next part.
df = spark.createDataFrame(df_result.rdd.map(parse_multiple_rows), schema=schema)
Where the function converts rows in a python dict and make some transformations:
def parse_multiple_rows(arr):
aux = arr.asDict()
...
return aux
In the DAG a extra SQLExecutionRDD javaToPython appear as difference between two versions. Another difference is that spark2 divides the stages in only one jobs while spark3 divide in four jobs, but the stages are the same an the end.
Related
I try to migrate my spark application from Spark 2.x to 3.x, but there is something weird for me.
In my application, there is a job with multiple join (maybe 40 ~ 50 dataframes join with a same base dataframe). And everything is OK for Spark 2.x, while on Spark 3.x, there is no DAG generated and no error logs either, the application seems to be suspended and I have no idea why.
And I try to force split those joins into multiple jobs, and things turn out to OK when each job has 5 joins.
I have a for loop in my pyspark code. When I test the code on around 5 loops it works fine. But when I run it on my core dataset which results in 160 loops, my pyspark job (submitted on an EMR cluster) fails. It first attempts it a second time before failing.
Below is a screenshot of the job runs in the Spark History Server:
The initial job Attempt ID 1 was run at 4:13pm and 4 hours later a second attempt Attempt ID 2 was done after which it failed. When I open up the jobs, I don't see any failed tasks or stages.
I am guessing it is because of the increasing size of the for loop.
Here is the stderr log of the output: It failed with status 1
Here is my pseudocode:
#Load Dataframe
df=spark.read.parquet("s3://path")
df=df.persist(StorageLevel.MEMORY_AND_DISK) # I will be using this df in the for loop
flist=list(df.select('key').distinct().toPandas()['key'])
output=[]
for i in flist:
df2=df.filter(col('key)==i))
Perform operations on df2 by each key that result in a dataframe df3
output.append(df3)
final_output = reduce(DataFrame.unionByName, output)
I think the output dataframe increments in size that the job eventually fails.
I am running 9 worker nodes with 8 vCores with 50GB of memory in each node.
Is there a way to write the output dataframe to a check point after a set number of loops, clear the memory and then continue the loops from where it left off in Spark?
EDIT:
My expected output is like so:
key mean prediction
3172742 0.0448 1
3172742 0.0419 1
3172742 0.0482 1
3172742 0.0471 1
3672767 0.0622 2
3672767 0.0551 2
3672767 0.0406 1
I can use groupBy function because I am performing a kmeans clustering and it doesnt allow groupBy. So I have to iterate over each key to perform the kmeans clustering.
I am trying to get a better understanding of the Spark internals and I am not sure how to interpret the resulting DAG of a job.
Inspired to the example described at http://dev.sortable.com/spark-repartition/,
I run the following code in the Spark shell to obtain the list of prime numbers from 2 to 2 million.
val n = 2000000
val composite = sc.parallelize(2 to n, 8).map(x => (x, (2 to (n / x)))).flatMap(kv => kv._2.map(_ * kv._1))
val prime = sc.parallelize(2 to n, 8).subtract(composite)
prime.collect()
After executing I checked the SparkUI and observed the DAG in figure.
Now my question is: I call the function subtract only once, why does this operation appears
three times in the DAG?
Also, is there any tutorial that explains a bit how Spark creates these DAGs?
Thanks in advance.
subtract is a transformation which requires a shuffle:
First both RDDs have to be repartitioned using the same partitioner Local ("map-side") part of the transformation is marked as subtract in the stages 0 and 1. At this point both RDDs are converted to (item, null) pairs.
substract you see in the stage 2 happens after the shuffle when RDDs have been combined. This where items are filtered.
In general any operation which requires a shuffle will be executed in at least two stages (depending on the number of predecessors) and tasks belonging to each stage will be shown separately.
I'm trying to understand the jobs that get created by Spark for simple first() vs collect() operations.
Given the code:
myRDD = spark.sparkContext.parallelize(['A', 'B', 'C'])
def func(d):
return d + '-foo'
myRDD = myRDD.map(func)
My RDD is split across 16 partitions:
print(myRDD.toDebugString())
(16) PythonRDD[24] at RDD at PythonRDD.scala:48 []
| ParallelCollectionRDD[23] at parallelize at PythonRDD.scala:475 []
If I call:
myRDD.collect()
I get 1 job with 16 tasks created. I assume this is one task per partition.
However, if I call:
myRDD.first()
I get 3 jobs, with 1, 4, and 11 tasks created. Why have 3 jobs been created?
I'm running spark-2.0.1 with a single 16-core executor, provisioned by Mesos.
It is actually pretty smart Spark behaviour. Your map() is transformation (it is lazy-evaluated) and both first() and collect() are actions (terminal operations). All transformations are applied to the data in time you call actions.
When you call first() then spark tries to perform as low number of operations (transformations) as possible. First, it tries one random partition. If there are no results, it takes 4 times partitions more and calculates. Again, if there are no results, spark takes 4 times partitions more (5 * 4) and again tries to get any result.
In your case in this third try you have only 11 untouched partitions (16 - 1 - 4). If you have more data in RDD or less number of partitions, spark probably can find the first() result sooner.
I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple but
shuffle-heavy operation:
val g = Graph.fromEdges(edges, ...).partitionBy(...)
val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors()
subs.saveAsObjectFile("hdfs://...")
The job gets divided into 9 stages (5000 tasks each). My cluster has 3 workers in the same local network.
Even though Spark 1.5.0 works much faster and first several stages run on the full load,
starting from one of the stages (mapPartitions at GraphImpl.scala:235), a single machine suddenly takes 99% of the tasks
while others take as many tasks as they have cores and those tasks stay RUNNING until the one machine that actually works
finishes everything. Interestingly, on Spark 1.3.1, all stages get their tasks distributed
evenly among the cluster machines. I'm suspecting that this could be a bug in 1.5.0
UPD: It seems like the problem is not data related: I have randomly generated a highly homogeneous graph (each vertex has degree 5) and observed identical behavior. So this is either a strange hardware problem or tungsten-related issue. Still no exact answer