Option for specifying Spark environment API when using Spark Shell - apache-spark

Is there an option you can pass to the spark-shell that specifies what environment you will be running your code against? In other words, if I am using Spark 1.3; can I specify that I wish to use the Spark 1.2 API ?
For example:
pyspark --api 1.2

spark-shell initializes org.apache.spark.repl.Main to start REPL, which does not parse any command line arguments. Hence no it will not be possible to pass api value from command line, you have use respective spark-shell binary from their respective versions of spark.

Related

Cannot modify the value of a Spark config: spark.executor.instances

I am using spark 3.0 and I am setting parameters
My parameters:
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.memory", 3)
Error:
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.executor.instances
I DONT want to pass it through spark-submit as this is pytest case that I am writing.
How do I get through this?
According to spark official documentation, the spark.executor.instances property may not be affected when setting programmatically through SparkConf in runtime, so it would be suggested to set through configuration file or spark-submit command line options.
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
You can try to add those option to PYSPARK_SUBMIT_ARGS before initialize SparkContext. Its syntax is similar to spark-submit.

Passing typesafe config conf files to DataProcSparkOperator

I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties.
I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs.
I am using the following python code for the airflow dataproc:
t3 = dataproc_operator.DataProcSparkOperator(
task_id ='execute_spark_job_cluster_test',
dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar',
cluster_name='cluster',
main_class = 'com.organ.ingestion.Main',
project_id='project',
dataproc_spark_properties={'spark.driver.extraJavaOptions':'gs://file-dev/fileConf/development.conf'},
scopes='https://www.googleapis.com/auth/cloud-platform', dag=dag)
But this is not working and I am getting some errors.
Could anyone help me with this?
Basically I want to be able to override the .conf files and pass them as arguments to my DataProcSparkOperator.
I also tried to do
arguments=`'gs://file-dev/fileConf/development.conf'`:
but this didn't take into account the .conf file mentioned in the arguments .
tl;dr You need to turn your development.conf file into a dictionary to pass to dataproc_spark_properties.
Full explanation:
There are two main ways to set properties -- on the cluster level and on the job level.
1) Job level
Looks like you are trying to set them on the job level: DataProcSparkOperator(dataproc_spark_properties={'foo': 'bar', 'foo2': 'bar2'}). That's the same as gcloud dataproc jobs submit spark --properties foo=bar,foo2=bar2 or spark-submit --conf foo=bar --conf foo2=bar2. Here is the documentation for per-job properties.
The argument to spark.driver.extraJavaOptions should be command line arguments you would pass to java. For example, -verbose:gc.
2) Cluster level
You can also set properties on a cluster level using DataprocClusterCreateOperator(properties={'spark:foo': 'bar', 'spark:foo2': 'bar2'}), which is the same as gcloud dataproc clusters create --properties spark:foo=bar,spark:foo2=bar2 (documentation). Again, you need to use a dictionary.
Importantly, if you specify properties at the cluster level, you need to prefix them with which config file you want to add the property to. If you use spark:foo=bar, that means add foo=bar to /etc/spark/conf/spark-defaults.conf. There are similar prefixes for yarn-site.xml, etc.
3) Using your .conf file at the cluster level
If you don't want to turn your .conf file into a dictionary, you can also just append it to /etc/spark/conf/spark-defaults.conf using an initialization action when you create the cluster.
E.g. (this is untested):
#!/bin/bash
set -euxo pipefail
gsutil cp gs://path/to/my.conf .
cat my.conf >> /etc/spark/conf/spark-defaults.conf
Note that you want to append to rather than replace the existing config file, just so that you only override the configs you need to.

Printing spark command in yarn and cluster mode

I need to print some commands in spark yarn mode. Obviously println(message) doesn't work... I want to find a way to collect the message. Can someone point me to the current method for example using collect?
How to use collect?
Does the below code work?
val c=message.collect()
println (c)
You can achieve this using the below command:
message.foreach(println)
You will find output of above call in executor logs.

How to check Spark configuration from command line?

Basically, I want to check a property of Spark's configuration, such as "spark.local.dir" through command line, that is, without writing a program. Is there a method to do this?
There is no option of viewing the spark configuration properties from command line.
Instead you can check it in spark-default.conf file. Another option is to view from webUI.
The application web UI at http://driverIP:4040 lists Spark properties in the “Environment” tab. Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.
For more details, you can refer Spark Configuration
Following command print your conf properties on console
sc.getConf.toDebugString
We can check in Spark shell using below command :
scala> spark.conf.get("spark.sql.shuffle.partitions")
res33: String = 200
Based on http://spark.apache.org/docs/latest/configuration.html. Spark provides three locations to configure the system:
Spark properties control most application parameters and can be set
by using a SparkConf object, or through Java system properties.
Environment variables can be used to set per-machine settings, such the IP address, through the conf/spark-env.sh script on each
node.
Logging can be configured through log4j.properties.
I haven't heard about method through command line.
Master command to check spark config from CLI
sc._conf.getAll()

How to specify Spark properties when starting Spark History Server?

Does anyone know how to set values in the SparkConf when starting the Spark History Server?
if you are using <SPARK_HOME>/sbin/start-history-server.sh then you cannot specify command line argument but you can specify SPARK_HISTORY_OPTS as environment variable and specify the various environment variables like: -
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.ui.port=9000
but if you are using <SPARK_HOME>/sbin/start-daemon.sh script then you can specify multiple command line options. like this: -
<SPARK_HOME>/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer -Dspark.history.ui.port=9000
start-history-server.sh accepts --properties-file [propertiesFile] command-line option to specify the custom Spark properties using propertiesFile.
When not specified explicitly, Spark History Server uses the default configuration file, i.e. conf/spark-defaults.conf.

Resources