Is there a spark configuration for the default path for the saveAsTable command? - apache-spark

I'm trying to save a dataframe as a table and I'm wondering if there is a default path configuration I can set to make my life easier.
I understand that this works:
df.write.saveAsTable("mytable", path='s3a://mybucket/mybucketlocation')
but is it possible to have this command
df.write.saveAsTable("mytable")
achieve the same role with spark configurations?
Currently I have this configuration set, but it's not doing the trick.
('spark.sql.warehouse.dir', 's3a://mybucket/mybucketlocation')

Related

Where is yarn.nodemanager.log-dirs in spark?

I have looked into:
log4j2.properties in /etc/spark/conf
yarn-site.xml
yarn-env.sh (via YARN_LOG_DIR it is not getting set. In fact while running a job there is no env variable YARN_LOG_DIR in my executors)
log4j.properties in /etc/hadoop/conf
Where can I find and modify the yarn.nodemanager.log-dirs property?
To find this, we need to traverse some of Hadoop's source code:
yarn.nodemanager.log-dirs defaults to ${yarn.log.dir}/userlogs.
yarn.log.dir defaults to $HADOOP_LOG_DIR
$HADOOP_LOG_DIR defaults to ${HADOOP_HOME}/logs
So, have a look at $HADOOP_HOME/logs/userlogs to see whether you find something in there!
If you want to edit it, you can do either of the following:
edit $HADOOP_HOME
edit $HADOOP_LOG_DIR
add -Dyarn.log.dir=<your_chosen_value> to your spark application
add -Dyarn.nodemanager.log-dirs=<your_chosen_value> to your spark application

Cannot modify the value of a Spark config: spark.executor.instances

I am using spark 3.0 and I am setting parameters
My parameters:
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.memory", 3)
Error:
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.executor.instances
I DONT want to pass it through spark-submit as this is pytest case that I am writing.
How do I get through this?
According to spark official documentation, the spark.executor.instances property may not be affected when setting programmatically through SparkConf in runtime, so it would be suggested to set through configuration file or spark-submit command line options.
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
You can try to add those option to PYSPARK_SUBMIT_ARGS before initialize SparkContext. Its syntax is similar to spark-submit.

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

How to check Spark configuration from command line?

Basically, I want to check a property of Spark's configuration, such as "spark.local.dir" through command line, that is, without writing a program. Is there a method to do this?
There is no option of viewing the spark configuration properties from command line.
Instead you can check it in spark-default.conf file. Another option is to view from webUI.
The application web UI at http://driverIP:4040 lists Spark properties in the “Environment” tab. Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.
For more details, you can refer Spark Configuration
Following command print your conf properties on console
sc.getConf.toDebugString
We can check in Spark shell using below command :
scala> spark.conf.get("spark.sql.shuffle.partitions")
res33: String = 200
Based on http://spark.apache.org/docs/latest/configuration.html. Spark provides three locations to configure the system:
Spark properties control most application parameters and can be set
by using a SparkConf object, or through Java system properties.
Environment variables can be used to set per-machine settings, such the IP address, through the conf/spark-env.sh script on each
node.
Logging can be configured through log4j.properties.
I haven't heard about method through command line.
Master command to check spark config from CLI
sc._conf.getAll()

How to specify Spark properties when starting Spark History Server?

Does anyone know how to set values in the SparkConf when starting the Spark History Server?
if you are using <SPARK_HOME>/sbin/start-history-server.sh then you cannot specify command line argument but you can specify SPARK_HISTORY_OPTS as environment variable and specify the various environment variables like: -
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.ui.port=9000
but if you are using <SPARK_HOME>/sbin/start-daemon.sh script then you can specify multiple command line options. like this: -
<SPARK_HOME>/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer -Dspark.history.ui.port=9000
start-history-server.sh accepts --properties-file [propertiesFile] command-line option to specify the custom Spark properties using propertiesFile.
When not specified explicitly, Spark History Server uses the default configuration file, i.e. conf/spark-defaults.conf.

Resources