Spark -Optimizing long running Job - apache-spark

we have a spark job that's taking long time to complete, Looked at the spark WebUI and I see lot of shuffling. Couple of things I tried but no luck so far. Increased the sql.shuffle partitions(tried 320,640 and 1600), # of executors (8) and memory (10/12gb) and 4 cores but no significant improvement. Appreciate any guidance on below:
1)when I see the event time line in spark web UI, only one executor is doing most of the processing and rest I don't see any significant activity -
Metrics, lot of difference for shuffle spill between 75th percentile and max -
Any pointers on how to investigate further will be of great help! basically looking for documentation on the event timeline as i see single executor is performing bulk of hte work and how to use the metrics to fix the perf issue by adjusting the spark configuration parameters if thats an option?

Related

High shuffle write and high execution time

I'm using EMR and I'm developing a collaborative-filtering approach with ALS. I have three doubt:
In order to see the execution time in Spark UI, I do several experiments. I've noticed that with one master and 4 workers the execution time is less than the EMR cluster with one master and six workers. Anyone know why?
The other thing is shuffle write. With one master and six workers, I have 3,2GB. Is it too high, isn't it? In the code I use RDD, groupbykey and I do two join. How can I minimize it?
With one master and six workers, the execution time is 7,5 min. Considering I'm using the Movielens Dataset, with Machine Learning approach I can't understand if this execution time is too high or it's quite good.
I attach a picture with the result form the Spark UI.
Thank you in advance

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

Spark concurrency performance issue Vs Presto

We are benchmarking spark with alluxio and presto with alluxio. For evaluating the performance we took 5 different queries (with some joins, group by and sort) and ran this on a dataset 650GB in orc.
Spark execution environment is setup in such a way that we have a ever running spark context and we are submitting queries using REST api (Jetty server). We are not considering first batch execution time for this load test as its taking little more time because of task deserialization and all.
What we observed while evaluating is that when we ran individual queries or even all these 5 queries executed concurrently, spark is performing very well compared to presto and is finishing all the execution in half the time than of presto.
But for actual load test, we executed 10 batches (one batch is this 5 queries submitted at the same time) with a batch interval of 60 sec. At this point presto is performing a lot better than spark. Presto finished all job in ~11 mins and spark is taking ~20 mins to complete all the task.
We tried different configurations to improve spark concurrency like
Using 20 pools with equal resource allocation and submitting jobs in a round robin fashion.
Tried using one FAIR pool and submitted all jobs to this default pool and let spark decide on resource allocations
Tuning some spark properties like spark.locality.wait and some other memory related spark properties.
All tasks are NODE_LOCAL (we replicated data in alluxio to make this happen)
Also tried playing arround with executor memory allocation, like tried with 35 small executors (5 cores, 30G) and also tried with (60core, 200G) executors.
But all are resulting in same execution time.
We used dstat on all the workers to see what was happening when spark was executing task and we could see no or minimal IO or network activity . And CPU was alway at 95%+ (Looks like its bounded on CPU) . (Saw almost similar dstat out with presto)
Can someone suggest me something which we can try to achieve similar or better results than presto?
And any explanation why presto is performing well with concurrency than spark ? We observed that presto's 1st batch is taking more time than the succeeding batches . Is presto cacheing some data in memory which spark is missing ? Or presto's resource management/ execution plan is better than spark ?
Note: Both clusters are running with same hardware configuration

Monitor Spark actual work time vs. communication time

On a Spark cluster, if the jobs are very small, I assume that the clustering will be inefficient since most of the time will be spent on communication between nodes, rather than utilizing the processors on the nodes.
Is there a way to monitor how much time out of a job submitted with spark-submit is wasted on communication, and how much on actual computation?
I could then monitor this ratio to check how efficient my file aggregation scheme or processing algorithm is in terms of distribution efficiency.
I looked through the Spark docs, and couldn't find anything relevant, though I'm sure I'm missing something. Ideas anyone?
You can see this information in the Spark UI, asuming you are running Spark 1.4.1 or higher (sorry but I don't know how to do this for earlier versions of Spark).
Here is a sample image:
Here is the page that the image came from.
A brief summary: You can view a timeline of all the events happening in your Spark job within the Spark UI. From there, you can zoom in on each individual job and each individual task. Each task is divided into shceduler delay, serialization / deserialization, computation, shuffle, etc.
Now, this is obviously a very pretty UI but you might want something more robust so that you can check this info programmatically. It appears here that you can use the REST API to export the logging info in JSON.

spark-cassandra-connector performance: executors seem to be idle

On our 40 nodes clusters (33 spark executors/5 nodes cassandra),
with spark-streaming we are inserting about 20 000 per min (among other things) in a cassandra table (with .saveToCassandra).
The result we get is :
If I understand things correctly, executors S3, S14 and S19 are idle 75% of the time and prevent the stage from finishing... Such a resources waste! And a performance loss.
Here are my conf options for my SparkContext:
.set("spark.cassandra.output.batch.size.rows", "5120")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100000")
.set("spark.cassandra.connection.keep_alive_ms","60000")
Is this behavior normal? If not should I tune the above settings to avoid it?
Does the problem come from the spark-cassandra-connector writes or is it something else?
On first glance I doubt this is a cassandra connector problem. We are currently doing .saveToCassandra with 300,000 records per minute and smaller clusters.
If it were .saveToCassandra taking a long time, you'd tend to see long tasks. What you're seeing is unexplained(?) gaps between tasks.
It's going to take a good bit more information to track this down. Start on the Jobs tab - do you see any jobs taking a long time? Drill down, what do you see?

Resources