How can I pass app-specific configuration to Spark workers? - apache-spark

I have a Spark app which uses many workers. I'd like to be able to pass simple configuration information to them easily (without having to recompile): e.g. USE_ALGO_A. If this was a local app, I'd just set the info in environment variables, and read them. I've tried doing something similar using spark-env.sh, but the variables don't seem to propagate properly.
How can I do simple runtime configuration of my code in the workers?
(PS I'm running a spark-ec2 type cluster)

You need to take care of configuring each worker.
From the Spark docs:
You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, such as JVM options. This file needs to be copied to every machine to reflect the change.
If you use an Amazon EC2 cluster, there is a script that RSYNC s a directory between teh master and all workers.
The easiest way to do this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.
see https://spark.apache.org/docs/latest/ec2-scripts.html

Related

How to sync configuration between hadoop worker machines

We have huge hadoop cluster and we installed one coordinator preso node
and 850 presto workers nodes. now we want to change the values in the file - config.properties but this should be done on all the workers!
so under
/opt/DBtasks/presto/presto-server-0.216/etc
the file is like this
[root#worker01 etc]# more config.properties
#
coordinator=false
http-server.http.port=8008
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://master01.sys76.com:8008
and we want to change it to
coordinator=false
http-server.http.port=8008
query.max-memory=500GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=20GB
discovery.uri=http://master01.sys76.com:8008
but this was done only on the first node - worker01, but we need to do it also on all workers. well - we can copy this file by scp to all other workers , but not in case root is restricted but what I want to know , if presto already think about more elegant approach that sync the configuration on all workers node as all know after we set new values we need also to restart the presto louncer script
dose presto have solution to this ?
I must to tell that my cluster is restricted root , so we cant copy the files VIA ssh
Presto does not have the ability to sync the configurations. This is something you would need to manage outside e.g. using a tool like Ansible. There is also project command line tool presto-admin (https://github.com/prestosql/presto-admin) that can assist with deploying the configs across the cluster.
Additionally, if you are using public clouds such as AWS, there are commercial solutions from Starburst (https://www.starburstdata.com/) that can assist management of the configurations as well.

Loading Spark Config for testing Spark Applications

I've been trying to test a spark application on my local laptop before deploying it to a cluster (to avoid having to package and deploy my entire application every time) but struggling on loading the spark config file.
When I run my application on a cluster, I am usually providing a spark config file to the application (using spark-submit's --conf). This file has a lot of config options because this application interacts with Cassandra and HDFS. However, when I try to do the same on my local laptop, I'm not sure exactly how to load this config file. I know I can probably write a piece of code that takes the file path of the config file and just goes through and parses all the values and sets them in the config, but I'm just wondering if there are easier ways.
Current status:
I placed the desired config file in the my SPARK_HOME/conf directory and called it spark-defaults.conf ---> This didn't get applied, however this exact same file runs fine using spark-submit
For local mode, when I create the spark session, I'm setting Spark Master as "local[2]". I'm doing this when creating the spark session, so I'm wondering if it's possible to create this session with a specified config file.
Did you added --properties-file flag with spark-defaults.conf value in your IDE as an argument for JVM?
In official documentation (https://spark.apache.org/docs/latest/configuration.html) there is continuous reference to 'your default properties file'. Some options can not be set inside your application, because the JVM has already started. And since conf directory is read only through spark-submit, I suppose you have to explicitly load configuration file when running locally.
This problem has been discussed here:
How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?
Not sure if this will help anyone, but I ended up reading the conf file from a test resource directory and then setting all the values as system properties (copied this from Spark Source Code):
//_sparkConfs is just a map of (String,String) populated from reading the conf file
for {
(k, v) ← _sparkConfs
} {
System.setProperty(k, v)
}
This is essentially emulating the --properties-file option of spark-submit to a certain degree. By doing this, I was able to keep this logic in my test setup, and not need to modify the existing application code.

How to set SPARK_LOCAL_DIRS parameter using spark-env.sh file

I am trying to change the location spark writes temporary files to. Everything I've found online says to set this by setting the SPARK_LOCAL_DIRS parameter in the spark-env.sh file, but I am not having any luck with the changes actually taking effect.
Here is what I've done:
Created a 2-worker test cluster using Amazon EC2 instances. I'm using spark 2.2.0 and the R sparklyr package as a front end. The worker nodes are spun up using an auto scaling group.
Created a directory to store temporary files in at /tmp/jaytest. There is one of these in each worker and one in the master.
Puttied into the spark master machine and the two workers, navigated to home/ubuntu/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh, and modified the file to contain this line: SPARK_LOCAL_DIRS="/tmp/jaytest"
Permissions for each of the spark-env.sh files are -rwxr-xr-x, and for the jaytest folders are drwxrwxr-x.
As far as I can tell this is in line with all the advice I've read online. However, when I load some data into the cluster it still ends up in /tmp, rather than /tmp/jaytest.
I have also tried setting the spark.local.dir parameter to the same directory, but also no luck.
Can someone please advise on what I might be missing here?
Edit: I'm running this as a standalone cluster (as the answer below indicates that the correct parameter to set depends on the cluster type).
As per the spark documentation it is clearly saying that if you have configured Yarn Cluster manager then it will be overwrite the spark-env.sh setting. Can you just check in Yarn-env or yarn-site file for the local dir folder setting.
"this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager."
source - https://spark.apache.org/docs/2.3.1/configuration.html
Mac env, spark-2.1.0, and spark-env.sh contains:
export SPARK_LOCAL_DIRS=/Users/kylin/Desktop/spark-tmp
Using spark-shell, it works.
Did you use the right format?

Best Practice For Deploying and Running Periodic Spark Job

I have a number of spark batch jobs each of which need to be run every x hours. I'm sure this must be a common problem but there seems to be relatively little on the internet as to what the best practice is here for setting this up. My current setup is as follows:
Build system (sbt) builds a tar.gz containing a fat jar + a script that will invoke spark-submit.
Once tests have passed, CI system (Jenkins) copies the tar.gz to hdfs.
I set up a chronos job to unpack the tar.gz to the local filesystem and run the script that submits to spark.
This setup works reasonably well, but there are some aspects of step 3) that I'm not fond of. Specifically:
I need a separate script (executed by chronos) that copies from hdfs, unpacks and runs the spark-submit task. As far as I can tell chrons can't run scripts from hdfs so I have to have a copy of this script on every mesos worker which makes deployment more complex that it would be if everything just lived on hdfs.
I have a feeling that I have too many moving parts. For example I was wondering if I could create an executable jar that could submit itself (args would be the spark master and the main class) in which case I would do away with at least one of the wrapper scripts. Unfortunately I haven't found a good way of doing this
As this is a problem that everyone faces I was wondering if anyone could give a better solution.
To download and extract archive you can use Mesos fetcher by specifying it in Chronos job config by setting uris field.
To do the same procedure on executors side you can set spark.executor.uri parameter in default Spark conf

Using Spark Shell (CLI) in standalone mode on distributed files

I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes.
The cluster starts healthy, and I am just trying to test my installation by running some simple examples via spark-shell (CLI - which I started on the master machine) : I simply put a file on the localfs on the master node (workers do NOT have a copy of this file) and I simply run:
$SPARKHOME/bin/spark-shell
...
scala> val f = sc.textFile("file:///PATH/TO/LOCAL/FILE/ON/MASTER/FS/file.txt")
scala> f.count()
and it returns the words count results correctly.
My Questions are:
1) This contradicts with what spark documentation (on using External Datasets) say as:
"If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system."
I am not using NFS and I did not copy the file to workers, so how does it work ? (Is it because spark-shell does NOT really launch jobs on the cluster, and does the computation locally (It is weird as I do NOT have a worker running on the node, I started shell on)
2) If I want to run SQL scripts (in standalone mode) against some large data files (which do not fit into one machine) through Spark's thrift server (like the way beeline or hiveserver2 is used in Hive) , do I need to put the files on NFS so each worker can see the whole file, or is it possible that I create chunks out of the files, and put each smaller chunk (which would fit on a single machine) on each worker, and then use multiple paths (comma separated) to pass them all to the submitted queries ?
The problem is that you are running the spark-shell locally. The default for running a spark-shell is as --master local[*], which will run your code locally on as many cores as you have. If you want to run against your workers, then you will need to run with the --master parameter specifying the master's entry point. If you want to see the possible options you can use with spark-shell, just type spark-shell --help
As to whether you need to put the file on each server, the short answer is yes. Something like HDFS will split it up across the nodes and the manager will handle the fetching as appropriate. I am not as familiar with NFS and if it has this capability, though

Resources