Spark: how to get all configuration parameters - apache-spark

I'm trying to find out what configuration parameters my spark app is executing with. Is there a way to get all parameters, including the default ones?
E.g. if you execute "set;" on a Hive console, it'll list full Hive configuration. I'm looking for an analogous action/command for Spark.
UPDATE:
I've tried the solution proposed by karthik manchala. I'm getting these results. As far as I know, these are not all parameters. E.g. this one spark.shuffle.memoryFraction (and a lot more) is missing.
scala> println(sc.getConf.getAll.deep.mkString("\n"));
(spark.eventLog.enabled,true)
(spark.dynamicAllocation.minExecutors,1)
(spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS,...)
(spark.repl.class.uri,http://...:54157)
(spark.tachyonStore.folderName,spark-46d43c17-b0b3-4b61-a017-a186075849ca)
(spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES,http://...)
(spark.driver.host,...l)
(spark.yarn.jar,local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar)
(spark.yarn.historyServer.address,http://...:18088)
(spark.dynamicAllocation.executorIdleTimeout,60)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.authenticate,false)
(spark.fileserver.uri,http://...:33681)
(spark.app.name,Spark shell)
(spark.dynamicAllocation.maxExecutors,30)
(spark.dynamicAllocation.initialExecutors,3)
(spark.ui.filters,org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter)
(spark.driver.port,46781)
(spark.shuffle.service.enabled,true)
(spark.master,yarn-client)
(spark.eventLog.dir,hdfs://.../user/spark/applicationHistory)
(spark.app.id,application_1449242356422_80431)
(spark.driver.appUIAddress,http://...:4040)
(spark.driver.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
(spark.dynamicAllocation.schedulerBacklogTimeout,1)
(spark.shuffle.service.port,7337)
(spark.executor.id,<driver>)
(spark.jars,)
(spark.dynamicAllocation.enabled,true)
(spark.executor.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
(spark.yarn.am.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)

You can do the following:
sparkContext.getConf().getAll();

Related

Grafana query parsing error with label_values

There are two CouchDB servers and I am using variable in Grafana to visualize some metrics, issue is query with variable end up with wrong prasing:
couchdb_server_node_info{instance="10\\.10\\.10\\.199:9984"}
I do not why it includes the slash, that leads to empty result. am I using correct query "label_values"?
Here is my variable setting, which the result shows two servers:
and here is how I use it:
it is fixed! I had to disable Include All Option

How can I find the value of specific Spark configuration property?

How can I find the value of a spark configuration in my spark code?
For example, I would like to find the value of spark.sql.shuffle.partitions and reference this in my code.
The following code will return all values:-
spark.sparkContext.getConf().getAll()
How can I retrieve a single configuration setting?
Like this.
spark.conf.get("spark.sql.shuffle.partitions")
'200' # returns default value here

Spark 1.6 - What is the difference between df.write.save() and df.write.parquet() [duplicate]

I'm using Spark-Java.
I need to know If there is any diffrence (performance etc) between the following write to Hadoop methods:
ds.write().mode(mode).format("orc").save(path);
Or
ds.write().mode(mode).orc(path);
Thanks.
There is no difference
orc(path) is simply a shortcut method for format("orc").save(path)
Same applies for .json(path) and csv(path) with the default write(path) with no format being Parquet

Can spark-defaults.conf resolve environment variables?

If I have a line like below in my spark-env.sh file
export MY_JARS==$(jars=(/my/lib/dir/*.jar); IFS=,; echo "${jars[*]}")
which gives me a comma delimited list of jars in /my/lib/dir, is there a way I can specify
spark.jars $MY_JARS
in the spark-defaults.conf?
tl;dr No, it cannot, but there is a solution.
Spark reads the conf file as a properties file without any additional env var substitution.
What you could do however is to write the computed value MY_JARS from spark-env.sh straight to spark-defaults.conf using >> (append). The last wins so no worry there could be many similar entries.
I tried with Spark 1.4 and did not worked.
spark-defaults.conf is a Key/ value and looking the code it seems values are not evaluated.
At least in Spark 3+, there is a way to do this: ${env:VAR_NAME}.
For instance if you want to add the current username to the Spark Metrics Namespace, add this to your spark-defaults.conf file:
spark.metrics.namespace=${env:USER}
The generated metrics will show the username instead of the default namespace:
testuser.driver.BlockManager.disk.diskSpaceUsed_MB.csv
testuser.driver.BlockManager.memory.maxMem_MB.csv
testuser.driver.BlockManager.memory.maxOffHeapMem_MB.csv
testuser.driver.BlockManager.memory.maxOnHeapMem_MB.csv
... etc ...
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/internal/VariableSubstitution.html
A helper class that enables substitution using syntax like ${var}, ${system:var} and ${env:var}.

Using indexed types for ElasticSearch in Titan

I currently have a VM running Titan over a local Cassandra backend and would like the ability to use ElasticSearch to index strings using CONTAINS matches and regular expressions. Here's what I have so far:
After titan.sh is run, a Groovy script is used to load in the data from separate vertex and edge files. The first stage of this script loads the graph from Titan and sets up the ES properties:
config.setProperty("storage.backend","cassandra")
config.setProperty("storage.hostname","127.0.0.1")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","db/es")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
The second part of the script sets up the indexed types:
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make();
The third part loads in the data from the CSV files, this has been tested and works fine.
My problem is, I don't seem to be able to use the ElasticSearch functions when I do a Gremlin query. For example:
g.E.has("property",CONTAINS,"test")
returns 0 results, even though I know this field contains the string "test" for that property at least once. Weirder still, when I change CONTAINS to something that isn't recognised by ElasticSearch I get a "no such property" error. I can also perform exact string matches and any numerical comparisons including greater or less than, however I expect the default indexing method is being used over ElasticSearch in these instances.
Due to the lack of errors when I try to run a more advanced ES query, I am at a loss on what is causing the problem here. Is there anything I may have missed?
Thanks,
Adam
I'm not quite sure what's going wrong in your code. From your description everything looks fine. Can you try the follwing script (just paste it into your Gremlin REPL):
config = new BaseConfiguration()
config.setProperty("storage.backend","inmemory")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","/tmp/es-so")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
g = TitanFactory.open(config)
g.makeKey("name").dataType(String.class).make()
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make()
g.makeLabel("knows").make()
g.commit()
alice = g.addVertex(["name":"alice"])
bob = g.addVertex(["name":"bob"])
alice.addEdge("knows", bob, ["property":"foo test bar"])
g.commit()
// test queries
g.E.has("property",CONTAINS,"test")
g.query().has("property",CONTAINS,"test").edges()
The last 2 lines should return something like e[1t-4-1w][4-knows-8]. If that works and you still can't figure out what's wrong in your code, it would be good if you can share your full code (e.g. in Github or in a Gist).
Cheers,
Daniel

Resources