We are benchmarking spark with alluxio and presto with alluxio. For evaluating the performance we took 5 different queries (with some joins, group by and sort) and ran this on a dataset 650GB in orc.
Spark execution environment is setup in such a way that we have a ever running spark context and we are submitting queries using REST api (Jetty server). We are not considering first batch execution time for this load test as its taking little more time because of task deserialization and all.
What we observed while evaluating is that when we ran individual queries or even all these 5 queries executed concurrently, spark is performing very well compared to presto and is finishing all the execution in half the time than of presto.
But for actual load test, we executed 10 batches (one batch is this 5 queries submitted at the same time) with a batch interval of 60 sec. At this point presto is performing a lot better than spark. Presto finished all job in ~11 mins and spark is taking ~20 mins to complete all the task.
We tried different configurations to improve spark concurrency like
Using 20 pools with equal resource allocation and submitting jobs in a round robin fashion.
Tried using one FAIR pool and submitted all jobs to this default pool and let spark decide on resource allocations
Tuning some spark properties like spark.locality.wait and some other memory related spark properties.
All tasks are NODE_LOCAL (we replicated data in alluxio to make this happen)
Also tried playing arround with executor memory allocation, like tried with 35 small executors (5 cores, 30G) and also tried with (60core, 200G) executors.
But all are resulting in same execution time.
We used dstat on all the workers to see what was happening when spark was executing task and we could see no or minimal IO or network activity . And CPU was alway at 95%+ (Looks like its bounded on CPU) . (Saw almost similar dstat out with presto)
Can someone suggest me something which we can try to achieve similar or better results than presto?
And any explanation why presto is performing well with concurrency than spark ? We observed that presto's 1st batch is taking more time than the succeeding batches . Is presto cacheing some data in memory which spark is missing ? Or presto's resource management/ execution plan is better than spark ?
Note: Both clusters are running with same hardware configuration
Related
Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.
I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.
I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks
We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.
I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.