Is there any way to run spark scripts and store outputs in parallel with oozie? - apache-spark

I have 3 spark scripts and every one of them has 1 spark sql to read a partitioned table and store to some hdfs location. Every script has a different sql statement and different folder location to store data into.
test1.py - Read from table 1 and store to location 1.
test2.py - Read from table 2 and store to location 2.
test3.py - Read from table 3 and store to location 3.
I run these scripts using fork action in oozie and all three run. But the problem is that the scripts are not storing data in parallel.
Once the store from one script is done then the other store starts.
My expectation is to store all 3 tables data into their respective locations parallely.
I have tried FAIR scheduling and other scheduler techniques in the sparks scripts but those don't work. Can anyone please help.I am stuck with it from last 2 days.
I am using AWS EMR 5.15, Spark 2.4 and Oozie 5.0.0.

For Capacity scheduler
If you are submitting job to a single queue whichever job come first in the queue gets the resources. intra-queue preemption won't work.
I can see a related Jira for Intraqueue preemption in Capacity scheduler. https://issues.apache.org/jira/browse/YARN-10073
You can read more https://blog.cloudera.com/yarn-capacity-scheduler/
For Fair scheduler
Setting the "yarn.scheduler.fair.preemption" parameter to "true" in yarn-site.xml enables preemption at the cluster level. By default this is false i.e no preemption.
Your problem could be:
1 job is taking maximum resources. To verify this please check Yarn UI and Spark UI.
Or if you have more than 1 yarn queue (other than default). Try setting User Limit Factor > 1 for the queue you are using.

Related

Number of Tasks in Spark UI

I am new to Spark. I have couple of questions regarding the Spark Web UI:-
I have seen that Spark can create multiple Jobs for the same
application. On what basis does it creates the Jobs ?
I understand Spark creates multiple Stages for a single Job around
Shuffle boundaries. Also I understand that there is 1 task per
partition. However, I have seen that a particular Stage (E.g. Stage1)
of a particular Job creating lesser number of tasks than the default
shuffle partitions value (for e.g. only 2/2 completed). And I have
also seen, the next Stage (Stage 2) of the same Job creating
1500 tasks (for E.g. 1500/1500 completed) which is more than
the default shuffle partitions value.
So, how does Spark determine how many tasks should it
create for any particular Stage to execute ?
Can anyone please help me understand the above.
the max number of task in one moment dependent on you cores and exec numbers,
different stage have different task number

Spark write to HDFS is slow

I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size.
Iam reading the data in DF, writing the DF without ay transformations using partitionBy
ex:
df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

Schedule each Apache Spark Stage to run on a specific Worker Node

Suppose, I am running a simple Wordcount application on Spark (actually Spark Streaming) with 2 worker nodes. By default each task (from any stage) is scheduled to any available resource based on a scheduling algorithm. However, I want to change the default scheduling to fix each stage to a specific worker node.
Here is what I am trying to achieve -
Worker Node 'A' should only process the first Stage (like 'map' stage). So all the data that comes in must first go to worker 'A'
and Worker Node 'B' should only process the second stage (like 'reduce' stage). Effectively, the results of Worker A are processed by Worker B.
My first question is - Is this sort of customisation possible on Spark or Spark Streaming by tuning the parameters or choosing a correct config option? (I don't think it is, but can someone confirm this?)
My second question is - Can I achieve this by making some change to the Spark scheduler code? I am ok hardcoding the IPs of the workers if necessary. Any hints or pointers to this specific problem or even understanding the Spark Scheduler code in more detail would be helpful..
I understand that this change defeats the efficiency goals of Spark to some extent but I am only looking to experiment with different setups for a project.
Thanks!

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Resources