SnappyData - configuring streaming job spark settings - apache-spark

I can see how to configure a SparkConf when creating a streaming application (see here)
I assume that I can configure the SparkConf through the SnappyStreamingContext for a streaming job similar to a streaming application. Let's say I get a handle to the SparkConf in a streaming job and modify some settings. Do these settings only apply to this streaming job or is this a global configuration update for all jobs?
thanks!

Yes, you can configure the SparkConf through the SnappyStreamingContext for a streaming job, and it is same as spark streaming configuration. Since SparkConf is a global configuration it is applicable to all the jobs in a streaming application. I think Spark doesn't allow you to change SparkConf after starting your application.

Related

Spark session/context lifecycle

I don't understand how the spark session/context lifecycle works. The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed? For example, if I have a production cluster and I spark-submit 10 ETLs, will these 10 jobs share the same SparkContext? Does it matter if I do this in cluster/client mode?
To the best of my understanding, the SparkContext lives in the driver so I assume the above would result in one SparkContext shared by 10 SparkSessions, but I'm not at all sure I got this correctly... Any clarification will be much appreciated.
Let's understand sparkSession and sparkContext
SparkContext is a channel to access all Spark functionality.The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs and knows what resource manager (YARN) to communicate to.And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
With Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry.
It means SparkSession Encapsulates SparkContext.
Let’s say we have multiple users accessing the same notebook which had shared sparkContext and the requirement was to have an isolated environment sharing the same spark context. Prior to 2.0, the solution to this was to create multiple sparkContexts ie sparkContext per isolated environment or users and is an expensive operation(a single sparkContext exists per JVM). But with the introduction of the spark session, this issue has been addressed.
I spark-submit 10 ETLs, will these 10 jobs share the same
SparkContext? Does it matter if I do this in cluster/client mode? To
the best of my understanding, the SparkContext lives in the driver so
I assume the above would result in one SparkContext shared by 10
SparkSessions,
If you submit 10 ETL spark-submit jobs whether it is cluster/client they all are different application and they have their own sparkContext and sparkSession.In native spark you can't share objects between different application but if you want share objects you have to use share contexts(spark-jobserver).Multiple option are available like Apache Ivy, apache-ignite
There is one SparkContext per spark application.
The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed?
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
If you have an existing spark session and want to create new one, use the newSession method on the existing SparkSession.
import org.apache.spark.sql.{SQLContext, SparkSession}
val newSparkSession1 = spark.newSession()
val newSparkSession2 = spark.newSession()
The newSession method creates a new spark session with isolated SQL configurations, temporary tables.The new session will share the underlying SparkContext and cached data.
Then you can use these different sessions to submit different jobs/sql queries.
newSparkSession1.sql("<ETL-1>")
newSparkSession2.sql("<ETL-2>")
Does it matter if I do this in cluster/client mode?
Client/Cluster mode doesn't matter.

Possible to add extra jars to master/worker nodes AFTER spark submit at runtime?

I'm writing a service that runs on a long-running Spark application from a spark submit. The service won't know what jars to put on the classpaths by the time of the initial spark submit, so I can't include it using --jars. This service will then listen for requests that can include extra jars, which I then want to load onto my spark nodes so work can be done using these jars.
My goal is to call spark submit only once, being at the very beginning to launch my service. Then I'm trying to add jars from requests to the spark session by creating a new SparkConf and building a new SparkSession out of it, something like
SparkConf conf = new SparkConf();
conf.set("spark.driver.extraClassPath", "someClassPath")
conf.set("spark.executor.extraClassPath", "someClassPath")
SparkSession.builder().config(conf).getOrCreate()
I tried this approach but it looks like the jars aren't getting loaded onto the executor classpaths as my jobs don't recognize the UDFs from the jars. I'm trying to run this in Spark client mode right now.
Is there a way to add these jars AFTER a spark-submit has been
called and just update the existing Spark application or is it only possible with another spark-submit that includes these jars using --jars?
Would using cluster mode vs client mode matter in this kind of
situation?

Controlling log size in Spark streaming job

We have Spark streaming job running in HDInsight Spark cluster (yarn mode) and we are seeing the streaming job stopping after few weeks due to what looks like running out of disk space due to log volume.
Is there a way to set limit on log size for Spark streaming job and enable rolling log? I have tried setting the below spark executor log properties in code, but this setting doesn’t seem to be honored.
val sparkConfiguration: SparkConf = EventHubsUtils.initializeSparkStreamingConfigurations
sparkConfiguration.set("spark.executor.logs.rolling.maxRetainedFiles", "2")
sparkConfiguration.set("spark.executor.logs.rolling.maxSize", "107374182")
val spark = SparkSession
.builder
.config(sparkConfiguration)
.getOrCreate()

what to specify as spark master when running on amazon emr

Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.
I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?
It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.
Code snippet:
SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.
Short answer: don't.
Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".
However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.

Checkpointing is not working in spark streaming

We are putting data file in HDFS path which is monitored by spark streaming application. And spark streaming application sending data to kafka topic. We are stopping streaming application ?in between and again starting so that it should start from where it stopped. But it is processing whole input data file again. So i guess checkpointing is not properly being used. We are using spark 1.4.1 version
How we can make the streaming application to start from the point where it failed/stopped?
Thanks in advance.
While creating the context use getOfCreate(checkpoint,..) to load previous checkpointed data if any.
eg: JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDir,..)
Check a working sample program https://github.com/atulsm/Test_Projects/blob/master/src/spark/StreamingKafkaRecoverableDirectEvent.java

Resources