Can spark-defaults.conf resolve environment variables? - apache-spark

If I have a line like below in my spark-env.sh file
export MY_JARS==$(jars=(/my/lib/dir/*.jar); IFS=,; echo "${jars[*]}")
which gives me a comma delimited list of jars in /my/lib/dir, is there a way I can specify
spark.jars $MY_JARS
in the spark-defaults.conf?

tl;dr No, it cannot, but there is a solution.
Spark reads the conf file as a properties file without any additional env var substitution.
What you could do however is to write the computed value MY_JARS from spark-env.sh straight to spark-defaults.conf using >> (append). The last wins so no worry there could be many similar entries.

I tried with Spark 1.4 and did not worked.
spark-defaults.conf is a Key/ value and looking the code it seems values are not evaluated.

At least in Spark 3+, there is a way to do this: ${env:VAR_NAME}.
For instance if you want to add the current username to the Spark Metrics Namespace, add this to your spark-defaults.conf file:
spark.metrics.namespace=${env:USER}
The generated metrics will show the username instead of the default namespace:
testuser.driver.BlockManager.disk.diskSpaceUsed_MB.csv
testuser.driver.BlockManager.memory.maxMem_MB.csv
testuser.driver.BlockManager.memory.maxOffHeapMem_MB.csv
testuser.driver.BlockManager.memory.maxOnHeapMem_MB.csv
... etc ...
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/internal/VariableSubstitution.html
A helper class that enables substitution using syntax like ${var}, ${system:var} and ${env:var}.

Related

How is the option --define used in arangoimport?

In the documentation, is it not clear how can I use this option?
Is it for telling arangoimport : "Hey, please use this field as _from/_to field when you import" ?
define string… Define key=value for a #key# entry in config file
This has nothing to do with data import. arangod, arangosh etc. all support --define to set environment variables, which can be used in configuration files using placeholders like #FOO# and be set like --define FOO=something on the command line.
This is briefly explained here: https://www.arangodb.com/docs/stable/administration-configuration.html#environment-variables-as-parameters
Example configuration file example.conf:
[server]
endpoint = tcp://127.0.0.1:#PORT#
Example invocation:
arangosh --config example.conf --define PORT=8529
For delimited source files (CSV, TSV) you can use the option --translate to map columns to different attributes, e.g. --translate "child=_from" --translate "parent=_to".
https://www.arangodb.com/docs/stable/programs-arangoimport-examples-csv.html#attribute-name-translation
If the references are just keys, then you may use --from-collection-prefix and to-collection-prefix to prepend the collection name.
--translate is not supported for JSON inputs. You can do the translation and import using a driver, or edit the source file somehow, or import into a collection and then use AQL to adjust the fields.

How can I find the value of specific Spark configuration property?

How can I find the value of a spark configuration in my spark code?
For example, I would like to find the value of spark.sql.shuffle.partitions and reference this in my code.
The following code will return all values:-
spark.sparkContext.getConf().getAll()
How can I retrieve a single configuration setting?
Like this.
spark.conf.get("spark.sql.shuffle.partitions")
'200' # returns default value here

Linux configuration files names with numbers

I just wonder why some tools have default configuration files with numbers in their names.
For example: 50-default.conf (for rsyslog).
What's the reason for this number and what does it mean ?
These numbers are for config file ordering and precedence. I.e. if the same parameter configuration is present in 10-smth.conf and 20-smth.conf the latter will overwrite first one.

In Zeppelin, how can I change the value of parameters of form zeppelin.x.y

In particular, how can I change the value of zeppelin.spark.sql.stacktrace?
An error message gives the following comment:
cannot recognize input near 'SELECT' 'X' '.' in expression specification; line 64 pos 10
set zeppelin.spark.sql.stacktrace = true to see full stacktrace
But how, exactly, do I set zeppelin.spark.sql.stacktrace to true? I've tried various config options such as adding an XML definition in zeppelin-site.xml, adding additional Java options via zeppelin-env.sh, etc. with no difference being made.
The answer is the not-very-obvious interpreter.json file, also in Zeppelin's conf directory. Simply find this entry:
"zeppelin.spark.sql.stacktrace": "false"
and change it to
"zeppelin.spark.sql.stacktrace": "true"
Restart Zeppelin and you will get the full stack traces with SQL errors.

Spark: how to get all configuration parameters

I'm trying to find out what configuration parameters my spark app is executing with. Is there a way to get all parameters, including the default ones?
E.g. if you execute "set;" on a Hive console, it'll list full Hive configuration. I'm looking for an analogous action/command for Spark.
UPDATE:
I've tried the solution proposed by karthik manchala. I'm getting these results. As far as I know, these are not all parameters. E.g. this one spark.shuffle.memoryFraction (and a lot more) is missing.
scala> println(sc.getConf.getAll.deep.mkString("\n"));
(spark.eventLog.enabled,true)
(spark.dynamicAllocation.minExecutors,1)
(spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS,...)
(spark.repl.class.uri,http://...:54157)
(spark.tachyonStore.folderName,spark-46d43c17-b0b3-4b61-a017-a186075849ca)
(spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES,http://...)
(spark.driver.host,...l)
(spark.yarn.jar,local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar)
(spark.yarn.historyServer.address,http://...:18088)
(spark.dynamicAllocation.executorIdleTimeout,60)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.authenticate,false)
(spark.fileserver.uri,http://...:33681)
(spark.app.name,Spark shell)
(spark.dynamicAllocation.maxExecutors,30)
(spark.dynamicAllocation.initialExecutors,3)
(spark.ui.filters,org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter)
(spark.driver.port,46781)
(spark.shuffle.service.enabled,true)
(spark.master,yarn-client)
(spark.eventLog.dir,hdfs://.../user/spark/applicationHistory)
(spark.app.id,application_1449242356422_80431)
(spark.driver.appUIAddress,http://...:4040)
(spark.driver.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
(spark.dynamicAllocation.schedulerBacklogTimeout,1)
(spark.shuffle.service.port,7337)
(spark.executor.id,<driver>)
(spark.jars,)
(spark.dynamicAllocation.enabled,true)
(spark.executor.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
(spark.yarn.am.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
You can do the following:
sparkContext.getConf().getAll();

Resources