How to set SPARK_LOCAL_DIRS parameter using spark-env.sh file - apache-spark

I am trying to change the location spark writes temporary files to. Everything I've found online says to set this by setting the SPARK_LOCAL_DIRS parameter in the spark-env.sh file, but I am not having any luck with the changes actually taking effect.
Here is what I've done:
Created a 2-worker test cluster using Amazon EC2 instances. I'm using spark 2.2.0 and the R sparklyr package as a front end. The worker nodes are spun up using an auto scaling group.
Created a directory to store temporary files in at /tmp/jaytest. There is one of these in each worker and one in the master.
Puttied into the spark master machine and the two workers, navigated to home/ubuntu/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh, and modified the file to contain this line: SPARK_LOCAL_DIRS="/tmp/jaytest"
Permissions for each of the spark-env.sh files are -rwxr-xr-x, and for the jaytest folders are drwxrwxr-x.
As far as I can tell this is in line with all the advice I've read online. However, when I load some data into the cluster it still ends up in /tmp, rather than /tmp/jaytest.
I have also tried setting the spark.local.dir parameter to the same directory, but also no luck.
Can someone please advise on what I might be missing here?
Edit: I'm running this as a standalone cluster (as the answer below indicates that the correct parameter to set depends on the cluster type).

As per the spark documentation it is clearly saying that if you have configured Yarn Cluster manager then it will be overwrite the spark-env.sh setting. Can you just check in Yarn-env or yarn-site file for the local dir folder setting.
"this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager."
source - https://spark.apache.org/docs/2.3.1/configuration.html

Mac env, spark-2.1.0, and spark-env.sh contains:
export SPARK_LOCAL_DIRS=/Users/kylin/Desktop/spark-tmp
Using spark-shell, it works.
Did you use the right format?

Related

Spark master/worker not writing logs within History Server configuration

Quite new to Spark setup. I want to persist event logs for each separate spark cluster I run. In my setup, /history-logs dir is mounted from different locations based on cluster name. The directory's permissions allow read-write for spark user(751).
Config file under $SPARK_HOME/conf/spark-defaults.conf for both master and workers is as follows:
spark.eventLog.enabled true
spark.eventLog.dir file:///history-logs
spark.history.fs.logDirectory file:///history-logs
I'm connecting from Zeppelin and running a simple piece of code:
val rdd = sc.parallelize(1 to 5)
println(rdd.sum())
No files are being written to the folder though.
If I configure similar parameters in the Zeppellin interpreter itself, at least I see that application log file is created in the directory.
Is it possible to save the logs from master/workers on per-cluster basis? I might be missing something obvious.
Thanks!

How to sync configuration between hadoop worker machines

We have huge hadoop cluster and we installed one coordinator preso node
and 850 presto workers nodes. now we want to change the values in the file - config.properties but this should be done on all the workers!
so under
/opt/DBtasks/presto/presto-server-0.216/etc
the file is like this
[root#worker01 etc]# more config.properties
#
coordinator=false
http-server.http.port=8008
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://master01.sys76.com:8008
and we want to change it to
coordinator=false
http-server.http.port=8008
query.max-memory=500GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=20GB
discovery.uri=http://master01.sys76.com:8008
but this was done only on the first node - worker01, but we need to do it also on all workers. well - we can copy this file by scp to all other workers , but not in case root is restricted but what I want to know , if presto already think about more elegant approach that sync the configuration on all workers node as all know after we set new values we need also to restart the presto louncer script
dose presto have solution to this ?
I must to tell that my cluster is restricted root , so we cant copy the files VIA ssh
Presto does not have the ability to sync the configurations. This is something you would need to manage outside e.g. using a tool like Ansible. There is also project command line tool presto-admin (https://github.com/prestosql/presto-admin) that can assist with deploying the configs across the cluster.
Additionally, if you are using public clouds such as AWS, there are commercial solutions from Starburst (https://www.starburstdata.com/) that can assist management of the configurations as well.

Loading Spark Config for testing Spark Applications

I've been trying to test a spark application on my local laptop before deploying it to a cluster (to avoid having to package and deploy my entire application every time) but struggling on loading the spark config file.
When I run my application on a cluster, I am usually providing a spark config file to the application (using spark-submit's --conf). This file has a lot of config options because this application interacts with Cassandra and HDFS. However, when I try to do the same on my local laptop, I'm not sure exactly how to load this config file. I know I can probably write a piece of code that takes the file path of the config file and just goes through and parses all the values and sets them in the config, but I'm just wondering if there are easier ways.
Current status:
I placed the desired config file in the my SPARK_HOME/conf directory and called it spark-defaults.conf ---> This didn't get applied, however this exact same file runs fine using spark-submit
For local mode, when I create the spark session, I'm setting Spark Master as "local[2]". I'm doing this when creating the spark session, so I'm wondering if it's possible to create this session with a specified config file.
Did you added --properties-file flag with spark-defaults.conf value in your IDE as an argument for JVM?
In official documentation (https://spark.apache.org/docs/latest/configuration.html) there is continuous reference to 'your default properties file'. Some options can not be set inside your application, because the JVM has already started. And since conf directory is read only through spark-submit, I suppose you have to explicitly load configuration file when running locally.
This problem has been discussed here:
How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?
Not sure if this will help anyone, but I ended up reading the conf file from a test resource directory and then setting all the values as system properties (copied this from Spark Source Code):
//_sparkConfs is just a map of (String,String) populated from reading the conf file
for {
(k, v) ← _sparkConfs
} {
System.setProperty(k, v)
}
This is essentially emulating the --properties-file option of spark-submit to a certain degree. By doing this, I was able to keep this logic in my test setup, and not need to modify the existing application code.

Why is spark .sparkStaging folder under hdfs when running spark on yarn in local machine?

i am trying to figure out why my spark .sparkStaging folder default to be under /user/name/ folder on my local hdfs ? I never set the working directory for spark at all. Hence why and how do i get this set with hdfs ? what configuration sets the default for that. I checked in spark environement in the UI Tab, and the config of yarn, and i can't see anything that sets that. Can someone give me a hint on that ?

How can I pass app-specific configuration to Spark workers?

I have a Spark app which uses many workers. I'd like to be able to pass simple configuration information to them easily (without having to recompile): e.g. USE_ALGO_A. If this was a local app, I'd just set the info in environment variables, and read them. I've tried doing something similar using spark-env.sh, but the variables don't seem to propagate properly.
How can I do simple runtime configuration of my code in the workers?
(PS I'm running a spark-ec2 type cluster)
You need to take care of configuring each worker.
From the Spark docs:
You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, such as JVM options. This file needs to be copied to every machine to reflect the change.
If you use an Amazon EC2 cluster, there is a script that RSYNC s a directory between teh master and all workers.
The easiest way to do this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.
see https://spark.apache.org/docs/latest/ec2-scripts.html

Resources