How to run arbitrary code on spark executors - apache-spark

How to run any arbitrary code - which is not related to RDDs - on all of the executors.
lets say I want to save the date into a file on all drivers and executors periodically
Files.write(Paths.get("file.txt"), ZoneddateTime.now().toString().getBytes(StandardCharsets.UTF_8))
This will do it on the driver only.
edit: I know that I can send files to the executors at spark-submit time, but I want to create/update files runtime.

Related

Spark write to HDFS is slow

I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size.
Iam reading the data in DF, writing the DF without ay transformations using partitionBy
ex:
df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

Spark Yarn running 1000 jobs in queue

I am trying to schedule 1000 jobs in Yarn cluster. I want to run more then 1000 jobs daily at same time and yarn to manage the resources. For 1000 files of different category from hdfs i am trying to create spark submit command from python and execute. But i am getting out of memory error due to spark submit using driver memory.
How can schedule 1000 jobs in spark yarn cluster? I even tried oozie job scheduling framework along with spark, it did not work as expected with HDP.
Actually, you might not need 1000 jobs to read from 1000 files in HDFS. You could try to load everything in a single RDD as well (the APIs do support reading multiple files and wildcards in paths). Now, after reading all the files in a single RDD, you should really focus on ensuring if you have enough memory, cores, etc. assigned to it and start looking at your business logic which avoids costly operations like shuffles, etc.
But, if you insist that you need to spawn 1000 jobs, one for each file, you should look at --executor-memory and --executor-cores (along with num-executors for parallelism). These give you leverage to optimise for memory/CPU footprint.
Also curious, you are saying that you get OOM during spark-submit (using driver memory). The driver doesn't really use any memory at all, unless you do things like collect or take with large set, which bring the data from the executors to the driver. Also you are firing the jobs in yarn-client mode? Another hunch is to check if the box where you spawn spark spark jobs has even enough memory just to spawn the jobs in the first place?
It will be easier if you could also paste some logs here.

Spark Executor Id in JAVA_OPTS

I was trying to profile some Spark jobs and I want to collect Java Flight Recorder(JFR) files from each executor. I am running my job on a YARN cluster with several nodes, so I cannot manually collect JRF file for each run. I want to write a script which can collect JFR file from each node in cluster for a given job.
MR provides a way to name JFR files generated by each task with taskId. It replaces '#task#' with TaskId in Java opts. With this I can get a unique name for JFR files created by each task and the since TaskId also has JobId, I can parse it to distinguish files generated by different MR jobs.
I am wondering, if Spark has something similar. Does Spark provides a way to determine executorId in Java opts? Has anyone else has tried to do something similar and found a better way collect all JFR files for a Spark job?
You can't set an executor id in the opts, but you can get the executor Id from the event log, as well as the slave node bearing it.
However I believe the option you give to spark-submit for a yarn master and a standalone one have the same effect on executors JVM, so you should be fine!
You can use {{EXECUTOR_ID}} and {{APP_ID}} placeholders in spark.executor.extraJavaOptions parameter. They will be replaced by Spark with executor's ID and application's ID, respectively.

Utilize multiple executors and workers in Spark job

I am running spark on a standalone mode with below spark-env configuration -
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
With this I can see 4 workers on my spark UI 8080.
Now One thing is the number of executors on my master URL (4040) is just one, how can I increases this to say 2 per worker node.
Also when I am running a small code from spark its just making use of one executer, do I need to make any config change to ensure multiple executors on multiple workers are used.
Any help is appreciated.
Set spark.master parameter as local[k], where k is the number of threads you want to utilize. You'd better to write these parameters inside spark-submit command instead of using export.
Parallel processing is based on number of partions of RDD. If your Rdd has multiple partions then it will processed parallelly.
Do some Modification (repartion) in your code, it should work.

Resources