Spark Executor Id in JAVA_OPTS - apache-spark

I was trying to profile some Spark jobs and I want to collect Java Flight Recorder(JFR) files from each executor. I am running my job on a YARN cluster with several nodes, so I cannot manually collect JRF file for each run. I want to write a script which can collect JFR file from each node in cluster for a given job.
MR provides a way to name JFR files generated by each task with taskId. It replaces '#task#' with TaskId in Java opts. With this I can get a unique name for JFR files created by each task and the since TaskId also has JobId, I can parse it to distinguish files generated by different MR jobs.
I am wondering, if Spark has something similar. Does Spark provides a way to determine executorId in Java opts? Has anyone else has tried to do something similar and found a better way collect all JFR files for a Spark job?

You can't set an executor id in the opts, but you can get the executor Id from the event log, as well as the slave node bearing it.
However I believe the option you give to spark-submit for a yarn master and a standalone one have the same effect on executors JVM, so you should be fine!

You can use {{EXECUTOR_ID}} and {{APP_ID}} placeholders in spark.executor.extraJavaOptions parameter. They will be replaced by Spark with executor's ID and application's ID, respectively.

Related

How to run arbitrary code on spark executors

How to run any arbitrary code - which is not related to RDDs - on all of the executors.
lets say I want to save the date into a file on all drivers and executors periodically
Files.write(Paths.get("file.txt"), ZoneddateTime.now().toString().getBytes(StandardCharsets.UTF_8))
This will do it on the driver only.
edit: I know that I can send files to the executors at spark-submit time, but I want to create/update files runtime.

Is there any way to run spark scripts and store outputs in parallel with oozie?

I have 3 spark scripts and every one of them has 1 spark sql to read a partitioned table and store to some hdfs location. Every script has a different sql statement and different folder location to store data into.
test1.py - Read from table 1 and store to location 1.
test2.py - Read from table 2 and store to location 2.
test3.py - Read from table 3 and store to location 3.
I run these scripts using fork action in oozie and all three run. But the problem is that the scripts are not storing data in parallel.
Once the store from one script is done then the other store starts.
My expectation is to store all 3 tables data into their respective locations parallely.
I have tried FAIR scheduling and other scheduler techniques in the sparks scripts but those don't work. Can anyone please help.I am stuck with it from last 2 days.
I am using AWS EMR 5.15, Spark 2.4 and Oozie 5.0.0.
For Capacity scheduler
If you are submitting job to a single queue whichever job come first in the queue gets the resources. intra-queue preemption won't work.
I can see a related Jira for Intraqueue preemption in Capacity scheduler. https://issues.apache.org/jira/browse/YARN-10073
You can read more https://blog.cloudera.com/yarn-capacity-scheduler/
For Fair scheduler
Setting the "yarn.scheduler.fair.preemption" parameter to "true" in yarn-site.xml enables preemption at the cluster level. By default this is false i.e no preemption.
Your problem could be:
1 job is taking maximum resources. To verify this please check Yarn UI and Spark UI.
Or if you have more than 1 yarn queue (other than default). Try setting User Limit Factor > 1 for the queue you are using.

Get spark variables in oozie spark action

I am new to spark and oozie technologies.
I am trying to get few variables from spark and use it in next oozie action.
In "Decision" node spark submit will be called and few processing is done and a counter variable is generated
Eg: var counter = 8 from spark
So now I need to use this variable in next oozie action which is "take decision"
node.
take decision
[Decision ][counter]
When I googled I was able to find few solutions
1. Write to hdfs
2. Wrap spark submit in shell and use <capture-output>
(I am not able to use this as I use oozie spark action node)
Any other ways to do the same?
The best approach is store the values in either HDFS (Hive) or HBase/Cassandra and your decision action read the values.
If you are wrap spark-submit with shell action, there would be problem if you submit the jobs in cluster mode because spark-submit jobs to yarn cluster and run any of the node where you cannot get the output.

Spark Master filling temporary directory

I have a simple Spark app that reads some data, computes some metrics, and then saves the result (input and output are Cassandra table). This piece of code runs at regular intervals (i.e., every minute).
I have a Cassandra/Spark (Spark 1.6.1) and after a few minutes, my temporary directory on the master node of the Spark cluster is filled, and the master refuses to run any more jobs. I am submitting the job with spark-submit.
What is it that I am missing? How do I make sure that the master nodes removes the temporary folder?
Spark uses this directory as the scratch space and outputs temp map output files in there. This can be changed. You should take a look into spark.local.dir.
Every time you submit your app, the jar is copied to all the workers in a new app directory. How big is your jar? Are you building a fat jar including the datastax driver jar? In that case I am guessing your app would be a few MB. Running it every minute will fill up your disk very quickly.
Spark has two parameters to control the cleaning of the app directories:
spark.worker.cleanup.interval which control how often spark is going to clean
spark.worker.cleanupDataTtl which control how long an app directory should stay before being cleaned.
Both parameters are in seconds.
Hope this help!

Utilize multiple executors and workers in Spark job

I am running spark on a standalone mode with below spark-env configuration -
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
With this I can see 4 workers on my spark UI 8080.
Now One thing is the number of executors on my master URL (4040) is just one, how can I increases this to say 2 per worker node.
Also when I am running a small code from spark its just making use of one executer, do I need to make any config change to ensure multiple executors on multiple workers are used.
Any help is appreciated.
Set spark.master parameter as local[k], where k is the number of threads you want to utilize. You'd better to write these parameters inside spark-submit command instead of using export.
Parallel processing is based on number of partions of RDD. If your Rdd has multiple partions then it will processed parallelly.
Do some Modification (repartion) in your code, it should work.

Resources