How to dynamically add dependencies to spark executors at runtime - apache-spark

I would like to add archive dependencies to my spark executors in a way that would work similarly to how it functions when passing the archive paths in to the spark-submit with --archives option. However, I will not know what dependencies are required until runtime, so I need to do this programmatically after the spark job has already been submitted.
Is there a way to do this? I'm currently working on a hacky solution where I download the required archives from within the function running on the executors, however this is much slower than having the driver just download the archives once and then distribute them to the executors.

Assuming your resource manager is YARN, it is posible to set the property spark.yarn.dist.archives when creating the SparkSession.
SparkSession.builder \
.appName("myappname") \
.conf("spark.yarn.dist.archives", "file1.zip#file1,file2.zip#file2,...") \
.getOrCreate()
More info here: https://spark.apache.org/docs/latest/running-on-yarn.html
You may find the properties spark.yarn.dist.files and spark.yarn.dist.jars useful too.

Related

where do I create a spark configuration file and set.driver.memory to 2gb?

I am new to spark and relatively new to Linux in general. I am running Spark on local Ubuntu in client mode. I have RAM of 16 GB. I installed apache spark following this link. And I am able to run and process large volume of data. The challenge is exporting the resulting data frames in csv. With even 100k rows of data I am getting all sorts of memory issues. On contrast I was able to process partitioned python files of totaling several millions of rows.
Based on lots of googling, I believe the problem lies with my spark.driver.memory. I need to change this but since I am running on client mode I should change it in some configuration file. How can I locate if I have an existing Spark configuration file or how do I create a new one and set spark.driver.memory to 2GB.
You can change the default value for all sessions in
$SPARK_HOME/spark-defaults.conf
If you do not find spark-defaults.conf you should have a file spark-defaults.conf.template, just cp spark-defaults.conf.template spark-defaults.conf and edit it uncommenting the line:
# spark.driver.memory 5g
Alternatively, you can set the value just for the current session using .config in the session builder:
spark = SparkSession.builder \
.master("local[*]") \
.appName("myApp") \
.config("spark.driver.memory", "5g") \
.getOrCreate()
(perhaps you might also want to increase spark.executor.memory)
See also my other answer to a similar question.

Fail to enable hive support in Spark submit (spark 3)

I am using spark in an integration test suite. It has to run locally and read/write files to local file-system. I also want to read/write these data as tables.
In the first step of the suite I write some hive tables in the db feature_store specifying
spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse. The step completes correctly and I see the files in the folder I expect.
Afterwards I run a spark-submit step with (among others) these confs
--conf spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse --conf spark.sql.catalogImplementation=hive
and when trying to read a table previously written I get
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'feature_store' not found
However if I try to do exactly the same thing with exactly the same configs in a spark-shell I am able to read the data.
In the spark-submit I use the following code to get the spark-session
SparkSession spark = SparkSession.active();
I have also tried to use instead
SparkSession spark = SparkSession.builder().enableHiveSupport().getOrCreate();
but I keep getting the same problem as above.
I have understood that the problem is related to the spark-submit not picking up hive as
catalog implementation. In fact I see that the class spark.catalog is not an instance of HiveCatalogImpl during the spark-submit (while it is when using spark-shell).

What is the difference between submitting spark job to spark-submit and to hadoop directly?

I have noticed that in my project there are 2 ways of running spark jobs.
First way is submitting a job to spark-submit file
./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100
Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:
hadoop jar JarFile.jar MainClassName
`
What is the difference between these 2 ways?
Which prerequisites I need to have in order to use either?
As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.
The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.

Spark : Understanding Dynamic Allocation

I have launched a spark job with the following configuration :
--master yarn --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=19 --conf spark.dynamicAllocation.minExecutors=0
It works well and finished in success, but after checking spark history ui, this is what i saw :
My questions are (Im concerned by understanding more than solutions) :
Why spark request the last executor if it has no task to do ?
How can i optimise cluster resource requested by my job in the dynamic allocation mode ?
Im using Spark 2.3.0 on Yarn.
You need to respect the 2 requierements for using spark dynamic allocation:
spark.dynamicAllocation.enable
spark.shuffle.service.enabled => The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files.
The resources are adjusted dynamically based on the workload. The app will give resources back if you are no longer using it.
I am not sure that there is an order, it will just request executors in round and exponentially, i.e: an application will add 1 executor in the first round, and then 2, 4 8 and so on...
Configuring external shuffle service
It's difficult to know what Spark did there without knowing the content of the job you submitted. Unfortunately the configuration string you provided does not say much about what Spark will actually perform upon job submission.
You will likely get a better understanding of what happened during a task by looking at the 'SQL' part of the history UI (right side of the top bar) as well as at the stdout logs.
Generally one of the better places to read about how Spark works is the official page: https://spark.apache.org/docs/latest/cluster-overview.html
Happy sparking ;)
Its because of the allocation policy :
Additionally, the number of executors requested in each round
increases exponentially from the previous round.
reference

spark streaming application and kafka log4j appender issue

I am testing my spark streaming application, and I have multiple functions in my code:
- some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD).
I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self.
But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.
Does anyone know reason to this behaviour?
I am working on a single virtual machine, where I have Spark & Kafka. I submit applications using spark submit.
EDITED
Actually I have figured out the part of the problem. Data gets appended to Kafka only from the part of the code that is in my main function. All the code that Is outside of my main, doesnt write data to kafka.
In main I declared the logger like this:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
While outside of my main, I had to declare it like:
#transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
in order to avoid serialization issues.
The reason might be behind JVM serialization concept, or simply because workers don't see the log4j configuration file (but my log4j file is in my source code, in resource folder)
Edited 2
I have tried in many ways to send log4j file to executors but not working. I tried:
sending log4j file in --files command of spark-submit
setting: --conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties" in spark-submit
setting log4j.properties file in --driver-class-path of spark-submit...
None of this option worked.
Anyone has the solution? I do not see any errors in my error log..
Thank you
I think you are close..first you want to make sure all the files are exported to the WORKING DIRECTORY (not CLASSPATH) on all nodes using --files flag. And then you want to reference these files to extracClassPath option of executors and driver. I have attached the following command, hope it helps. Key is to understand once the files are exported, all the files can be accessed on the node using just file name of the working directory (and not url path).
Note: Putting log4j file in the resources folder will not work. (at least when i had tried, it didnt.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar

Resources