I am using spark 3.0 and I am setting parameters
My parameters:
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.memory", 3)
Error:
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.executor.instances
I DONT want to pass it through spark-submit as this is pytest case that I am writing.
How do I get through this?
According to spark official documentation, the spark.executor.instances property may not be affected when setting programmatically through SparkConf in runtime, so it would be suggested to set through configuration file or spark-submit command line options.
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
You can try to add those option to PYSPARK_SUBMIT_ARGS before initialize SparkContext. Its syntax is similar to spark-submit.
Related
I'm trying to save a dataframe as a table and I'm wondering if there is a default path configuration I can set to make my life easier.
I understand that this works:
df.write.saveAsTable("mytable", path='s3a://mybucket/mybucketlocation')
but is it possible to have this command
df.write.saveAsTable("mytable")
achieve the same role with spark configurations?
Currently I have this configuration set, but it's not doing the trick.
('spark.sql.warehouse.dir', 's3a://mybucket/mybucketlocation')
This was already the object of discussion in previous post, however, I'm not convinced with the answers as the Google docs specify that it is possible to create a cluster setting the fs.defaultFS property. Moreover, even if possible to set this property programmatically, sometimes, it's more convenient to set it from command line.
So I wanted to know why the following option when passed to my cluster creation command does not work: --properties core:fs.defaultFS=gs://my-bucket? Please note I haven't included all parameters as I ran the command without the previous flag and it succeeded to create the cluster. However, when passing this, I get: "failed: Cannot start master: Insufficientnumber of DataNodes reporting."
If anyone managed to create a dataproc cluster by setting the fs.defaultFS that'd be great? Thanks.
It's true there are still known issues due to certain dependencies on actual HDFS; the docs were not intended to imply that setting fs.defaultFS to a GCS path at cluster-creation time would work, but to simply provide a convenient example of a property that appears in core-site.xml; in theory it would work to set fs.defaultFS to a different preexisting HDFS cluster, for example. I've filed a ticket to change the example in the documentation to avoid confusion.
Two options:
Just override fs.defaultFS at job-submission time using per-job properties
Workaround some of the known issues by setting fs.defaultFS explicitly using an initialization action instead of cluster properties.
Option 1 is better understood to work because cluster-level HDFS dependencies won't change. Option 2 works because most of the incompatibilities occur during initial startup only, and initialization actions run after the relevant daemons start up already. To override the setting in an init action, you'd use bdconfig:
bdconfig set_property \
--name 'fs.defaultFS' \
--value 'gs://my-bucket' \
--configuration_file /etc/hadoop/conf/core-site.xml \
--clobber
Followup from here.
I've added Custom Source and Sink in my application jar and found a way to get a static fixed metrics.properties on Stand-alone cluster nodes. When I want to launch my application, I give the static path - spark.metrics.conf="/fixed-path/to/metrics.properties". Despite my custom source/sink being in my code/fat-jar - I get ClassNotFoundException on CustomSink.
My fat-jar (with Custom Source/Sink code in it) is on hdfs with read access to all.
So here's what all I've already tried setting (since executors can't find Custom Source/Sink in my application fat-jar):
spark.executor.extraClassPath = hdfs://path/to/fat-jar
spark.executor.extraClassPath = fat-jar-name.jar
spark.executor.extraClassPath = ./fat-jar-name.jar
spark.executor.extraClassPath = ./
spark.executor.extraClassPath = /dir/on/cluster/* (although * is not at file level, there are more directories - I have no way of knowing random application-id or driver-id to give absolute name before launching the app)
It seems like this is how executors are getting initialized for this case (please correct me if I am wrong) -
Driver tells here's the jar location - hdfs://../fat-jar.jar and here are some properties like spark.executor.memory etc.
N number of Executors spin up (depending on configuration) on cluster
Start downloading hdfs://../fat-jar.jar but initialize metrics system in the mean time (? - not sure of this step)
Metrics system looking for Custom Sink/Source files - since it's mentioned in metrics.properties - even before it's done downloading fat-jar (which actually has all those files) (this is my hypothesis)
ClassNotFoundException - CustomSink not found!
Is my understanding correct? Moreover, is there anything else I can try? If anyone has experience with custom source/sinks, any help would be appreciated.
I stumbled upon the same ClassNotFoundException when I needed to extend existing GraphiteSink class and here's how I was able to solve it.
First, I created a CustomGraphiteSink class in org.apache.spark.metrics.sink package:
package org.apache.spark.metrics.sink;
public class CustomGraphiteSink extends GraphiteSink {}
Then I specified the class in metrics.properties
*.sink.graphite.class=org.apache.spark.metrics.sink.CustomGraphiteSink
And passed this file to spark-submit via:
--conf spark.metrics.conf=metrics.properties
In order to use custom source/sink, one has to distribute it using spark-submit --files and set it via spark.executor.extraClassPath
Basically, I want to check a property of Spark's configuration, such as "spark.local.dir" through command line, that is, without writing a program. Is there a method to do this?
There is no option of viewing the spark configuration properties from command line.
Instead you can check it in spark-default.conf file. Another option is to view from webUI.
The application web UI at http://driverIP:4040 lists Spark properties in the “Environment” tab. Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.
For more details, you can refer Spark Configuration
Following command print your conf properties on console
sc.getConf.toDebugString
We can check in Spark shell using below command :
scala> spark.conf.get("spark.sql.shuffle.partitions")
res33: String = 200
Based on http://spark.apache.org/docs/latest/configuration.html. Spark provides three locations to configure the system:
Spark properties control most application parameters and can be set
by using a SparkConf object, or through Java system properties.
Environment variables can be used to set per-machine settings, such the IP address, through the conf/spark-env.sh script on each
node.
Logging can be configured through log4j.properties.
I haven't heard about method through command line.
Master command to check spark config from CLI
sc._conf.getAll()
Does anyone know how to set values in the SparkConf when starting the Spark History Server?
if you are using <SPARK_HOME>/sbin/start-history-server.sh then you cannot specify command line argument but you can specify SPARK_HISTORY_OPTS as environment variable and specify the various environment variables like: -
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.ui.port=9000
but if you are using <SPARK_HOME>/sbin/start-daemon.sh script then you can specify multiple command line options. like this: -
<SPARK_HOME>/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer -Dspark.history.ui.port=9000
start-history-server.sh accepts --properties-file [propertiesFile] command-line option to specify the custom Spark properties using propertiesFile.
When not specified explicitly, Spark History Server uses the default configuration file, i.e. conf/spark-defaults.conf.