AWS Glue not using all executors - apache-spark

I am using some AWS glue to perform some ETL operations. My program writes a computed dataframe to S3. When I look at the metrics, i find that not all my executors are being used, infact just one is being used.
How do I make sure all my allocated executors are being busy ?
Thanks.
I do not use gluecontext in my program just native sparkcontext

Not using gluecontext could be one of the reason for one executor being used.
https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-straggler.html
Especially read Memory Profile section:
After the first two stages, only executor number 3 is actively
consuming memory to process the data. The remaining executors are
simply idle or have been relinquished shortly after the completion of
the first two stages.

I found that my job was not using all the executors, despite having a lot of data to process. The problem was in the set up on my SparkContext. I was using SparkContext.setMaster("local[*]"), which I believe makes the job run on only one executor (driver). If that helps your problem or anyone else facing the same issue.

Related

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

Number of Executor Cores and benefits or otherwise - Spark

Some run-time clarifications are requested.
In a thread elsewhere I read, it was stated that a Spark Executor should only have a single Core allocated. However, I wonder if this is really always true. Reading the various SO-questions and the likes of, as well as Karau, Wendell et al, it is clear that there are equal and opposite experts who state one should in some cases specify more Cores per Executor, but the discussion tends to be more technical than functional. That is to say, functional examples are lacking.
My understanding is that a Partition of an RDD or DF, DS, is serviced by a single Executor. Fine, no issue, makes perfect sense. So, how can the Partition benefit from multiple Cores?
If I have a map followed by, say a, filter, these are not two Tasks that can be interleaved - as in what Informatica does, as my understanding is they are fused together. This being so, then what is an example of benefit from an assigned Executor running more Cores?
From JL: In other (more technical) words, a Task is a computation on the records in a RDD partition in a Stage of a RDD in a Spark Job. What does it mean functionally speaking, in practice?
Moreover, can Executor be allocated if not all Cores can be acquired? I presume there is a wait period and that after a while it may be allocated in a more limited capacity. True?
From a highly rated answer on SO, What is a task in Spark? How does the Spark worker execute the jar file?, the following is stated: When you create the SparkContext, each worker starts an executor. From another SO question: When a SparkContext is created, each worker node starts an executor.
Not sure I follow these assertions. If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
I ask this, as even this excellent post How are stages split into tasks in Spark? does not give a practical example of multiple Cores per Executor. I can follow the post clearly and it fits in with my understanding of 1 Core per Executor.
My understanding is that a Partition (...) serviced by a single Executor.
That's correct, however the opposite is not true - a single executor can handle multiple partitions / tasks across multiple stages or even multiple RDDs).
then what is an example of benefit from an assigned Executor running more Cores?
First and foremost processing multiple tasks at the same time. Since each executor is a separate JVM, which is a relatively heavy process, it might preferable to keep only instance for a number of threads. Additionally it can provide further advantages, like exposing shared memory that can be used across multiple tasks (for example to store broadcast variables).
Secondary application is applying multiple threads to a single partition when user invokes multi-threaded code. That's however not something that is done by default (Number of CPUs per Task in Spark)
See also What are the benefits of running multiple Spark tasks in the same JVM?
If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
Pretty much by extension of the points made above - executors are not created to handle specific task / partition. There are long running processes, and as long as dynamic allocation is not enabled, there are intended to last for the full lifetime of the corresponding application / driver (preemption or failures, as well as already mentioned dynamic allocation, can affect that, but that's the basic model).

Spark Executor OOM issue

I have a typical batch job that reads CSV from cloud storage then do a bunch of join and aggregate, the whole file does not exceed 3G. But I keep getting OOM exception when writing the result back to storage, I have two executor, each has 80G of RAM, it just doesn't make sense, here is the screen shot of my spark UI and exception. And suggestion is appreciated, if my code is super sub-optimal in terms of memory, why it doesn't show up on the spark UI?
update: the source code is too convoluted to show here, but I figured out the essential cause is multiple join.
Dataset<Row> ret = something dataframe
for (String cmd : cmds) {
ret = ret.join(processDataset(ret, cmd), "primary_key")
}
so, each processDataset(ret, cmd), if you run it on its own, it's very fast, but if you have this kinda of for loop join for a lot of times, say 10 or 20 times, it gets much much much slower, and have this OOM issues.
When I have problems with memory I check these things:
Have more executors (more than 2, defined by total-executor-cores in spark-submit and spark.executor.core in SparkSession)
Have less cores per executor (3-5). You have 14 which much more than recommended (spark.executor.core)
Add memory to executors (spark.executor.memory)
Add memory to driver (driver-memory in spark-submit script)
Make more partitions (make partitions smaller in size) (.config("spark.sql.shuffle.partitions", numPartitionsShuffle) in SparkSession)
Look at PeakExecutionMemory of a Tasks in Stages (one of the additional metrics to turn on) tab to see if it is not to big
If you use Mesos in Agents tab you can see the real usage of memory per driver and executors (see this answer How to get Mesos Agents Framework Executor Memory
Look at explain in your code to analyze the execution plan
See if one of your joins does not explode your memory by making multiple duplicates of lines

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Spark Performance issue while adding more worker nodes

I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.

Resources