We have Spark streaming job running in HDInsight Spark cluster (yarn mode) and we are seeing the streaming job stopping after few weeks due to what looks like running out of disk space due to log volume.
Is there a way to set limit on log size for Spark streaming job and enable rolling log? I have tried setting the below spark executor log properties in code, but this setting doesn’t seem to be honored.
val sparkConfiguration: SparkConf = EventHubsUtils.initializeSparkStreamingConfigurations
sparkConfiguration.set("spark.executor.logs.rolling.maxRetainedFiles", "2")
sparkConfiguration.set("spark.executor.logs.rolling.maxSize", "107374182")
val spark = SparkSession
.builder
.config(sparkConfiguration)
.getOrCreate()
Related
I have a PySpark code running in Glue which fails with error "Remote TPC Client Disassociated.Likely due to containers exceeding thresholds, or network issues". It seems like it's a common issue with Spark and I have gone through a bunch of stack overflow and other forums. I tried tweaking a number of parameters in the Spark config but nothing worked hence posting this here. Would appreciate any inputs. Below are few details about my job, I have played with different values of the spark config for rpc and memory. I tried executor memory of 1g, 20g, 30g, 64g; Driver memory of 20g, 30g, 64g. Please advise.
Glue version: 3.0
Spark versio: 3.1
Spark Config:
spark= (SparkSession
.builder
.appName("jkTestApp")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.autoBroadcastJoinThreshold","-1")
.config("spark.rpc.message.maxSize", "512")
.config("spark.driver.memory","64g")
.config("spark.executor.memory","1g")
.config("spark.executor.memoryOverhead","512")
.config("spark.rpc.numRetries","10")
.getOrCreate()
)
Running Spark 3.x cluster on kubernetes with kubeflow installed.
I am able to run spark jobs w/o issue. During the running of the spark job, the Spark UI shows information about "Jobs", "Stages", "Environment" and "SQL". However, the "Executors" and "Storage" tabs are blank.
The spark job is running in client mode. The spark driver and Spark Executors are in separate pods.
I have set the following configuration parameters for the Spark job, which completes successfully.
spark = SparkSession.builder.appName("my_spark_app") \
.config("spark.eventLog.enabled", "true") \
.config("spark.eventLog.dir", "hdfs:///<hdfs-location") \
.config("spark.ui.prometheus.enabled", "true") \
# other spark config options
Any suggestions on configuration parameters I may be missing or setup for kubernetes pod that may prevent viewing the "Storage" and "Executor" information.
What’s a purpose of sparkContext and sparkConf ? Looking for detailed difference.
More than below definition:
Spark Context was the entry point of any spark application and used to access all spark features and needed a sparkConf which had all the cluster configs and parameters to create a Spark Context object.
The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI.
SparkConf is passed into SparkContext so our driver application knows how to access the cluster.
Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors.
Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.
We are running spark on mesos client mode.
We have also spark history server.
Spark log events can be seen fine in spark history server.
But how we can get the spark executor logs from spark ui or spark history server?
I can see how to configure a SparkConf when creating a streaming application (see here)
I assume that I can configure the SparkConf through the SnappyStreamingContext for a streaming job similar to a streaming application. Let's say I get a handle to the SparkConf in a streaming job and modify some settings. Do these settings only apply to this streaming job or is this a global configuration update for all jobs?
thanks!
Yes, you can configure the SparkConf through the SnappyStreamingContext for a streaming job, and it is same as spark streaming configuration. Since SparkConf is a global configuration it is applicable to all the jobs in a streaming application. I think Spark doesn't allow you to change SparkConf after starting your application.