How can I find the value of a spark configuration in my spark code?
For example, I would like to find the value of spark.sql.shuffle.partitions and reference this in my code.
The following code will return all values:-
spark.sparkContext.getConf().getAll()
How can I retrieve a single configuration setting?
Like this.
spark.conf.get("spark.sql.shuffle.partitions")
'200' # returns default value here
Related
I have a flow file containing key value pairs, will read from Kafka and send to Nifi, using Getfile I will receive the file and then using Configure Process we can extract the contents of flowfile and keep them as flowfile attributes by adding matching regex.
Now after that I need to use specific attribute(s)(which I have got in the above step) and its value(s) and compare with the particular string value or will read the corresponding property value from nifi.properties file or custom properties file.
Now based on the validity I need to execute a script using ExecuteStream Command, suppose the extracted attribute value and nifi.properties or custom properties value matches then i should execute the script.
Here the query is how to compare the property values and execute the script.
I am using java8 and cassandra in my application.
The datatype of current_date in cassandra table is 'date'.
I am using entities to map to the table values. and the datatype in entity for the same field is com.datastax.driver.core.LocalDate.
When I am trying to retrieve a record
'Select * from table where current_date='2017-06-06';'
I am getting the following error'
Caused by: com.datastax.driver.core.exceptions.CodecNotFoundException: Codec
not found for requested operation:
['org.apache.cassandra.db.marshal.SimpleDateType' <->
com.datastax.driver.core.LocalDate]
I faced a similar error message while querying cassandra from Presto.
I needed to set to cassandra.protocol-version=V4 in cassandra.properties in Presto to resolve the problem in my case.
If you get this problem while using a java SDK application, check whether changing protocol version resolves the problem. In some cases, you have to write your own codec implementation.
By default, Java driver will map date type into com.datastax.driver.core.LocalDate Java type.
If you need to convert date to java.time.LocalDate, then you need to add extras to your project :
You can specify codec for given column only:
#Column(codec = LocalDateCodec.class) java.time.LocalDate current_date ;
If these two didnot work, please have a look into the code how you are creating the session,cluster etc to connect to database. Since date is a new addition to cassandra data type, Protocol version can also have an impact.
Update the version accordingly
In apache Hive CLI or Beeline CLI, I need to concatenate value of a variable with a string. Is it possible to do so?
Example:
set path_on_hdfs="/apps/hive/warehouse/my_db.db";
how to get something like '${hivevar:path_on_hdfs}/myTableName'?
You can try something like this:
set path_on_hdfs= /test1/test2
create external table test(id int)
location '${hiveconf:path_on_hdfs}/myTable';
is should work...
I'm trying to find out what configuration parameters my spark app is executing with. Is there a way to get all parameters, including the default ones?
E.g. if you execute "set;" on a Hive console, it'll list full Hive configuration. I'm looking for an analogous action/command for Spark.
UPDATE:
I've tried the solution proposed by karthik manchala. I'm getting these results. As far as I know, these are not all parameters. E.g. this one spark.shuffle.memoryFraction (and a lot more) is missing.
scala> println(sc.getConf.getAll.deep.mkString("\n"));
(spark.eventLog.enabled,true)
(spark.dynamicAllocation.minExecutors,1)
(spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS,...)
(spark.repl.class.uri,http://...:54157)
(spark.tachyonStore.folderName,spark-46d43c17-b0b3-4b61-a017-a186075849ca)
(spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES,http://...)
(spark.driver.host,...l)
(spark.yarn.jar,local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar)
(spark.yarn.historyServer.address,http://...:18088)
(spark.dynamicAllocation.executorIdleTimeout,60)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.authenticate,false)
(spark.fileserver.uri,http://...:33681)
(spark.app.name,Spark shell)
(spark.dynamicAllocation.maxExecutors,30)
(spark.dynamicAllocation.initialExecutors,3)
(spark.ui.filters,org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter)
(spark.driver.port,46781)
(spark.shuffle.service.enabled,true)
(spark.master,yarn-client)
(spark.eventLog.dir,hdfs://.../user/spark/applicationHistory)
(spark.app.id,application_1449242356422_80431)
(spark.driver.appUIAddress,http://...:4040)
(spark.driver.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
(spark.dynamicAllocation.schedulerBacklogTimeout,1)
(spark.shuffle.service.port,7337)
(spark.executor.id,<driver>)
(spark.jars,)
(spark.dynamicAllocation.enabled,true)
(spark.executor.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
(spark.yarn.am.extraLibraryPath,/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native)
You can do the following:
sparkContext.getConf().getAll();
If I have a line like below in my spark-env.sh file
export MY_JARS==$(jars=(/my/lib/dir/*.jar); IFS=,; echo "${jars[*]}")
which gives me a comma delimited list of jars in /my/lib/dir, is there a way I can specify
spark.jars $MY_JARS
in the spark-defaults.conf?
tl;dr No, it cannot, but there is a solution.
Spark reads the conf file as a properties file without any additional env var substitution.
What you could do however is to write the computed value MY_JARS from spark-env.sh straight to spark-defaults.conf using >> (append). The last wins so no worry there could be many similar entries.
I tried with Spark 1.4 and did not worked.
spark-defaults.conf is a Key/ value and looking the code it seems values are not evaluated.
At least in Spark 3+, there is a way to do this: ${env:VAR_NAME}.
For instance if you want to add the current username to the Spark Metrics Namespace, add this to your spark-defaults.conf file:
spark.metrics.namespace=${env:USER}
The generated metrics will show the username instead of the default namespace:
testuser.driver.BlockManager.disk.diskSpaceUsed_MB.csv
testuser.driver.BlockManager.memory.maxMem_MB.csv
testuser.driver.BlockManager.memory.maxOffHeapMem_MB.csv
testuser.driver.BlockManager.memory.maxOnHeapMem_MB.csv
... etc ...
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/internal/VariableSubstitution.html
A helper class that enables substitution using syntax like ${var}, ${system:var} and ${env:var}.