Loading Spark Config for testing Spark Applications - apache-spark

I've been trying to test a spark application on my local laptop before deploying it to a cluster (to avoid having to package and deploy my entire application every time) but struggling on loading the spark config file.
When I run my application on a cluster, I am usually providing a spark config file to the application (using spark-submit's --conf). This file has a lot of config options because this application interacts with Cassandra and HDFS. However, when I try to do the same on my local laptop, I'm not sure exactly how to load this config file. I know I can probably write a piece of code that takes the file path of the config file and just goes through and parses all the values and sets them in the config, but I'm just wondering if there are easier ways.
Current status:
I placed the desired config file in the my SPARK_HOME/conf directory and called it spark-defaults.conf ---> This didn't get applied, however this exact same file runs fine using spark-submit
For local mode, when I create the spark session, I'm setting Spark Master as "local[2]". I'm doing this when creating the spark session, so I'm wondering if it's possible to create this session with a specified config file.

Did you added --properties-file flag with spark-defaults.conf value in your IDE as an argument for JVM?
In official documentation (https://spark.apache.org/docs/latest/configuration.html) there is continuous reference to 'your default properties file'. Some options can not be set inside your application, because the JVM has already started. And since conf directory is read only through spark-submit, I suppose you have to explicitly load configuration file when running locally.
This problem has been discussed here:
How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?

Not sure if this will help anyone, but I ended up reading the conf file from a test resource directory and then setting all the values as system properties (copied this from Spark Source Code):
//_sparkConfs is just a map of (String,String) populated from reading the conf file
for {
(k, v) ← _sparkConfs
} {
System.setProperty(k, v)
}
This is essentially emulating the --properties-file option of spark-submit to a certain degree. By doing this, I was able to keep this logic in my test setup, and not need to modify the existing application code.

Related

Spark master/worker not writing logs within History Server configuration

Quite new to Spark setup. I want to persist event logs for each separate spark cluster I run. In my setup, /history-logs dir is mounted from different locations based on cluster name. The directory's permissions allow read-write for spark user(751).
Config file under $SPARK_HOME/conf/spark-defaults.conf for both master and workers is as follows:
spark.eventLog.enabled true
spark.eventLog.dir file:///history-logs
spark.history.fs.logDirectory file:///history-logs
I'm connecting from Zeppelin and running a simple piece of code:
val rdd = sc.parallelize(1 to 5)
println(rdd.sum())
No files are being written to the folder though.
If I configure similar parameters in the Zeppellin interpreter itself, at least I see that application log file is created in the directory.
Is it possible to save the logs from master/workers on per-cluster basis? I might be missing something obvious.
Thanks!

How to set SPARK_LOCAL_DIRS parameter using spark-env.sh file

I am trying to change the location spark writes temporary files to. Everything I've found online says to set this by setting the SPARK_LOCAL_DIRS parameter in the spark-env.sh file, but I am not having any luck with the changes actually taking effect.
Here is what I've done:
Created a 2-worker test cluster using Amazon EC2 instances. I'm using spark 2.2.0 and the R sparklyr package as a front end. The worker nodes are spun up using an auto scaling group.
Created a directory to store temporary files in at /tmp/jaytest. There is one of these in each worker and one in the master.
Puttied into the spark master machine and the two workers, navigated to home/ubuntu/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh, and modified the file to contain this line: SPARK_LOCAL_DIRS="/tmp/jaytest"
Permissions for each of the spark-env.sh files are -rwxr-xr-x, and for the jaytest folders are drwxrwxr-x.
As far as I can tell this is in line with all the advice I've read online. However, when I load some data into the cluster it still ends up in /tmp, rather than /tmp/jaytest.
I have also tried setting the spark.local.dir parameter to the same directory, but also no luck.
Can someone please advise on what I might be missing here?
Edit: I'm running this as a standalone cluster (as the answer below indicates that the correct parameter to set depends on the cluster type).
As per the spark documentation it is clearly saying that if you have configured Yarn Cluster manager then it will be overwrite the spark-env.sh setting. Can you just check in Yarn-env or yarn-site file for the local dir folder setting.
"this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager."
source - https://spark.apache.org/docs/2.3.1/configuration.html
Mac env, spark-2.1.0, and spark-env.sh contains:
export SPARK_LOCAL_DIRS=/Users/kylin/Desktop/spark-tmp
Using spark-shell, it works.
Did you use the right format?

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

SparkContext.addFile vs spark-submit --files

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?
I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?
References I found about --files and for SparkContext.addFile.
It depends whether your Spark application is running in client or cluster mode.
In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.
If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

How can I pass app-specific configuration to Spark workers?

I have a Spark app which uses many workers. I'd like to be able to pass simple configuration information to them easily (without having to recompile): e.g. USE_ALGO_A. If this was a local app, I'd just set the info in environment variables, and read them. I've tried doing something similar using spark-env.sh, but the variables don't seem to propagate properly.
How can I do simple runtime configuration of my code in the workers?
(PS I'm running a spark-ec2 type cluster)
You need to take care of configuring each worker.
From the Spark docs:
You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, such as JVM options. This file needs to be copied to every machine to reflect the change.
If you use an Amazon EC2 cluster, there is a script that RSYNC s a directory between teh master and all workers.
The easiest way to do this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.
see https://spark.apache.org/docs/latest/ec2-scripts.html

Resources