Utilize multiple executors and workers in Spark job - apache-spark

I am running spark on a standalone mode with below spark-env configuration -
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
With this I can see 4 workers on my spark UI 8080.
Now One thing is the number of executors on my master URL (4040) is just one, how can I increases this to say 2 per worker node.
Also when I am running a small code from spark its just making use of one executer, do I need to make any config change to ensure multiple executors on multiple workers are used.
Any help is appreciated.

Set spark.master parameter as local[k], where k is the number of threads you want to utilize. You'd better to write these parameters inside spark-submit command instead of using export.

Parallel processing is based on number of partions of RDD. If your Rdd has multiple partions then it will processed parallelly.
Do some Modification (repartion) in your code, it should work.

Related

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

AWS Glue not using all executors

I am using some AWS glue to perform some ETL operations. My program writes a computed dataframe to S3. When I look at the metrics, i find that not all my executors are being used, infact just one is being used.
How do I make sure all my allocated executors are being busy ?
Thanks.
I do not use gluecontext in my program just native sparkcontext
Not using gluecontext could be one of the reason for one executor being used.
https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-straggler.html
Especially read Memory Profile section:
After the first two stages, only executor number 3 is actively
consuming memory to process the data. The remaining executors are
simply idle or have been relinquished shortly after the completion of
the first two stages.
I found that my job was not using all the executors, despite having a lot of data to process. The problem was in the set up on my SparkContext. I was using SparkContext.setMaster("local[*]"), which I believe makes the job run on only one executor (driver). If that helps your problem or anyone else facing the same issue.

How does Apache Spark assign partition-ids to its executors

I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.

Spark Yarn running 1000 jobs in queue

I am trying to schedule 1000 jobs in Yarn cluster. I want to run more then 1000 jobs daily at same time and yarn to manage the resources. For 1000 files of different category from hdfs i am trying to create spark submit command from python and execute. But i am getting out of memory error due to spark submit using driver memory.
How can schedule 1000 jobs in spark yarn cluster? I even tried oozie job scheduling framework along with spark, it did not work as expected with HDP.
Actually, you might not need 1000 jobs to read from 1000 files in HDFS. You could try to load everything in a single RDD as well (the APIs do support reading multiple files and wildcards in paths). Now, after reading all the files in a single RDD, you should really focus on ensuring if you have enough memory, cores, etc. assigned to it and start looking at your business logic which avoids costly operations like shuffles, etc.
But, if you insist that you need to spawn 1000 jobs, one for each file, you should look at --executor-memory and --executor-cores (along with num-executors for parallelism). These give you leverage to optimise for memory/CPU footprint.
Also curious, you are saying that you get OOM during spark-submit (using driver memory). The driver doesn't really use any memory at all, unless you do things like collect or take with large set, which bring the data from the executors to the driver. Also you are firing the jobs in yarn-client mode? Another hunch is to check if the box where you spawn spark spark jobs has even enough memory just to spawn the jobs in the first place?
It will be easier if you could also paste some logs here.

Spark cluster only using master

I have a Spark 1.1.0 cluster with three machines of differing power. When I run the start-all.sh script and check the UI I have all slaves and the master listed. Each worker is listed (they have differing number of cores) with the number of cores listed correctly but the notice that zero are used.
cores
4 (0 Used)
2 (0 Used)
8 (8 Used)
Ssh is set up and working, hadoop seems fine too. The 8 core machine is the master so any submitted job runs only there. I see it being executed in the web UI but the other workers are never given work.
What might be happening here is that the Total_Input_File_Size could be less than MAX_SPLIT_SIZE. So, there will be only one mapper running which will be executing only on the master.
The number of mappers generated are Total_Input_File_Size/MAX_SPLIT_SIZE. So, if you have given a small file, try to give a large input file or lower the value of max_split_size.
Let me know if the problem is anything else.
Have you set --deploy-mode cluster in your spark-submit command?
if you empty this option, the application will not travel to other workers.

Resources