spark-submit How to pass argument to jar mentioned in --jar? - apache-spark

I have a PySpark project where to log microbatch metrics, we have a jar which is passed using --jars, it works but I need to add a on/off switch for logging. How can I pass this on off switch when running using spark submit.

Related

Spark on YARN via JDBC Thrift?

When executing queries via the Thrift interface, how do I tell it to run the queries over YARN?
I'm trying to get Spark's JDBC/ODBC Thrift interface to run Spark-SQL calls on YARN. This combination seems to be absent from the documentation. The Spark on YARN docs give a bunch of options, but doesn't describe which configuration file in which to put them so that the Thrift server will pick them up.
I see a few of the settings mentioned in spark-env.sh (cores, executer memory, etc), but I can't figure out where to tell it to use YARN in the first place.
In order to make the Thriftserver use YARN for execution, it is necessary to start the thriftserver with the "--master yarn" parameter. This parameter can be appended to sbin/start-thriftserver.sh. Appending it here passes it through to the spark-submit script, which starts the server with that executor.
There is no equivalent in a config file.

Call multiple spark jobs within single EMR cluster

I want to call multiple spark jobs using spark-submit within single EMR cluster. Does EMR supports this?
How to achieve this?
I use AWS Lambda to invoke EMR job for my spark job at this point of time but we would like to extend to multiple spark jobs within single EMR cluster.
You can run multiple spark jobs on one EMR sequentially - that is, the next job will be launched after the previous job completes. This is done using EMR steps.
I used the Java SDK to run this, but you can see in this documentation how to add step using CLI only.
My code below uses spark-submit, but it's not run directly as you would run it in the CLI. Instead I ran it as a shell script, and included an environment variable for HADOOP_USER_NAME so the spark job is run under the username I specify. You can skip it if you want to run the job under the username you logged into your EMR (hadoop, by default).
In the code excerpt below the object emr is of type AmazonElasticMapReduce, provided in the sdk. If you're using the CLI approach you will not need it.
Some assisting methods like uploadConfFile are self-explanatory. I used an extensive configuration for the spark application, and unlike the files and jars which can be local or in s3/hdfs, the configuration file must be in a local file on the EMR itself.
When you finish, you will have created a step on your EMR cluster that will launch a new spark application. You can specify many steps on your EMR which will run one after the other.
//Upload the spark configuration you wish to use to a local file
uploadConfFile(clusterId, sparkConf, confFileName);
//create a list of arguments - which is the complete command for spark-submit
List<String> stepargs = new ArrayList<String>();
//start with an envelope to specify the hadoop user name
stepargs.add("/bin/sh");
stepargs.add("-c");
//call to spark-submit with the incantation stating its arguments are provided next.
stepargs.add("HADOOP_USER_NAME="+task.getUserName()+" spark-submit \"$#\"");
stepargs.add("sh");
//add the spark-submit arguments
stepargs.add("--class");
stepargs.add(mainClass);
stepargs.add("--deploy-mode");
stepargs.add("cluster");
stepargs.add("--master");
stepargs.add("yarn");
stepargs.add("--files");
//a comma-separated list of file paths in s3
stepargs.add(files);
stepargs.add("--jars");
//a comma-separated list of file paths in s3
stepargs.add(jars);
stepargs.add("--properties-file");
//the file we uploaded to the EMR, with its full path
stepargs.add(confFileName);
stepargs.add(jar);
//add the jar specific arguments in here
AddJobFlowStepsResult result = emr.addJobFlowSteps(new AddJobFlowStepsRequest()
.withJobFlowId(clusterId)
.withSteps(new StepConfig()
.withName(name)
.withActionOnFailure(ActionOnFailure.CONTINUE)
.withHadoopJarStep(new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(stepargs))));

Possible to add extra jars to master/worker nodes AFTER spark submit at runtime?

I'm writing a service that runs on a long-running Spark application from a spark submit. The service won't know what jars to put on the classpaths by the time of the initial spark submit, so I can't include it using --jars. This service will then listen for requests that can include extra jars, which I then want to load onto my spark nodes so work can be done using these jars.
My goal is to call spark submit only once, being at the very beginning to launch my service. Then I'm trying to add jars from requests to the spark session by creating a new SparkConf and building a new SparkSession out of it, something like
SparkConf conf = new SparkConf();
conf.set("spark.driver.extraClassPath", "someClassPath")
conf.set("spark.executor.extraClassPath", "someClassPath")
SparkSession.builder().config(conf).getOrCreate()
I tried this approach but it looks like the jars aren't getting loaded onto the executor classpaths as my jobs don't recognize the UDFs from the jars. I'm trying to run this in Spark client mode right now.
Is there a way to add these jars AFTER a spark-submit has been
called and just update the existing Spark application or is it only possible with another spark-submit that includes these jars using --jars?
Would using cluster mode vs client mode matter in this kind of
situation?

User specific properties file in Spark (.hiverc equivalent)

We are trying to set some additional properties like adding custom built spark listeners, adding jars to driver and executor classpaths etc for each Spark Job getting submitted.
Found below implementations:
Change the spark-submit launcher script to add these extra properties
Edit the spark-env.sh add add these properties to "SPARK_SUBMIT_OPTS" and "SPARK_DIST_CLASSPATH" variables
Add a --properties-file option to spark-submit launcher script
Would like to check if this can be done specific to users something like .hiverc in hive instead of doing it at the cluster level. This allows us to perform A/B testing of the features we newly build.

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Resources