Which PySpark session properties are ignored if set after session creation? - apache-spark

It's possible to set PySpark session properties with the command below:
spark = (SparkSession
.builder()
.appName("my-app")
.config("property.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate())
But some properties must be set during session creation. For example, a lot driver memory options won't work if set after the session creation (even though if their new value are returned from the command spark.sparkContext.getConf().getAll()).
Which are the properties that must be set during session creation?

Related

Unable to set configuration variable in SparkConf

Background:
Iam currently using Spark Lineage information about all the operations happening around. I have a transformation which has more than 35 fields and I need to log the same. However In Spark the default you can log 25 fields as per Spark Code. This could be overwritten by setting
spark.debug.maxToStringFields
So here is how I do the same
Code
val sparkConf = new SparkConf().set("spark.debug.maxToStringFields","100")
.setMaster("local[*]").setAppName("My App")
val sparkSession = SparkSession.builder().conf(sparkConf).getOrCreate()
However the property doesnt seem to be setting in the spark session.
DEBUG
val allConfs = sparkSession.sparkContext.getConf
allConfs.foreach(conf =>println(conf._1+" value "+conf._2))
Here iam unable to see the property that I have set. Also I still get the error/message that spark gives when the default length is 25
What am i missing here?

spark setCassandraConf is not working as expected

I am using .setCassandraConf(c_options_conf) to set sparkSession to connect cassandra cluster as show below.
Working fine:
val spark = SparkSession
.builder()
.appName("DatabaseMigrationUtility")
.config("spark.master",devProps.getString("deploymentMaster"))
.getOrCreate()
.setCassandraConf(c_options_conf)
If I save table using dataframe writer object as below it is pointing to the configured cluster and saving in Cassandra perfectly fine as below
writeDfToCassandra(o_vals_df, key_space , "model_vals"); //working fine using o_vals_df.
But if say as below it is pointing to localhost instead of cassandra cluster and failing to save.
Not working:
import spark.implicits._
val sc = spark.sparkContext
val audit_df = sc.parallelize(Seq(LogCaseClass(columnFamilyName, status,
error_msg,currentDate,currentTimeStamp, updated_user))).saveToCassandra(keyspace, columnFamilyName);
It is throwing error as it is trying connect localhost.
Error:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: localhost/127.0.0.1:9042
(com.datastax.driver.core.exceptions.TransportException:
[localhost/127.0.0.1:9042] Cannot connect))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:233)
What is wrong here? Why it is pointing to default localhost even though sparkSession set to cassandra cluster and earlier method is working fine.
We need to set the config using two set methods of SparkSession, i.e. .config(conf) and .setCassandraConf(c_options_conf) with same values like below
val spark = SparkSession
.builder()
.appName("DatabaseMigrationUtility")
.config("spark.master",devProps.getString("deploymentMaster"))
.config("spark.dynamicAllocation.enabled",devProps.getString("spark.dynamicAllocation.enabled"))
.config("spark.executor.memory",devProps.getString("spark.executor.memory"))
.config("spark.executor.cores",devProps.getString("spark.executor.cores"))
.config("spark.executor.instances",devProps.getString("spark.executor.instances"))
.config(conf)
.getOrCreate()
.setCassandraConf(c_options_conf)
Then i would work for cassandra latest api as well as RDD/DF Api.
Setting IP via spark.cassandra.connection.host Spark property (not via setCassandraConf!) works for both RDD & DataFrames. This property could be set from command-line when submitting the job, or explicitly (example from documentation):
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "192.168.123.10")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
val sc = new SparkContext("spark://192.168.123.10:7077", "test", conf)
Take look onto documentation for connector, including reference about existing configuration properties.

What is the purpose of global temporary views?

Trying to understand how to use the Spark Global Temporary Views.
In one spark-shell session I've created a view
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
df = (
spark.read.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("/user/root/data/cars.csv"))
df.createGlobalTempView("my_cars")
# works without any problem
spark.sql("SELECT * FROM global_temp.my_cars").show()
And on another I tried to access it, without success (table or view not found).
#second Spark Shell
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
spark.sql("SELECT * FROM global_temp.my_cars").show()
That's the error I receive :
pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`my_cars`; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`my_cars`\n"
I've read that each spark-shell has its own context, and that's why one spark-shell cannot see the other. So I don't understand, what's the usage of the GTV, where will it be useful ?
Thanks
in the spark documentation you can see:
If you want to have a temporary view that is shared among all sessions
and keep alive until the Spark application terminates, you can create
a global temporary view.
The global table remains accessible as long as the application is alive.
Opening a new shell and giving it the same application will just create a new application.
you can try and test it within the same shell:
spark.newSession.sql("SELECT * FROM global_temp.my_cars").show()
please see my answer on a similar question for a more detailed example as well as a short definition of a Spark Application and Spark Session
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it,
df.createGlobalTempView("people")

How do I get independent service Zeppelin to see Hive?

I am using HDP-2.6.0.3 but I need Zeppelin 0.8, so I have installed it as an independent service. When I run:
%sql
show tables
I get nothing back and I get 'table not found' when I run Spark2 SQL commands. Tables can be seen in the 0.7 Zeppelin that is part of HDP.
Can anyone tell me what I am missing, for Zeppelin/Spark to see Hive?
The steps I performed to create the zep0.8 are as follows:
maven clean package -DskipTests -Pspark-2.1 -Phadoop-2.7-Dhadoop.version=2.7.3 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.11
Copied zeppelin-site.xml and shiro.ini from /usr/hdp/2.6.0.3-8/zeppelin/conf to /home/ed/zeppelin/conf.
created /home/ed/zeppelin/conf/zeppeli-env.sh in which I put the following:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_CONF_DIR=/etc/hadoop/conf
export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.6.0.3-8"
Copied /etc/hive/conf/hive-site.xml to /home/ed/zeppelin/conf
EDIT:
I have also tried:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://s2.royble.co.uk:9083") // replace with your hivemetastore service's thrift url
.config("url", "jdbc:hive2://s2.royble.co.uk:10000/default")
.config("UID", "admin")
.config("PWD", "admin")
.config("driver", "org.apache.hive.jdbc.HiveDriver")
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate()
same result, and:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
which gives:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
ERROR XSDB6: Another instance of Derby may have already booted the database /home/ed/metastore_db
Fixed error with:
val url = "jdbc:hive2://s2.royble.co.uk:10000"
but still no tables :(
This works:
import java.sql.{DriverManager, Connection, Statement, ResultSet}
val url = "jdbc:hive2://s2.royble.co.uk:10000"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "admin"
val password = "admin"
Class.forName(driver).newInstance
val conn: Connection = DriverManager.getConnection(url, user, password)
val r: ResultSet = conn.createStatement.executeQuery("SELECT * FROM tweetsorc0")
but then I have the pain of converting the resultset to a dataframe. I'd rather SparkSession worked and I get a dataframe so I will add a bounty later today.
I had a similar problem in Cloudera Hadoop. In my case the problem was that spark sql did not see my hive metastore. So when I used my Spark Session object for spark SQL I could not see my previously created tables. I managed to solve it with adding in zeppelin-env.sh
export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export HADOOP_HOME=/opt/cloudera/parcels/CDH
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
(I assume for Horton Works these paths are something else). I also change spark.master from local[*] to yarn-client at Interpreter UI. Most importantly I manually copied hive-site.xml in /etc/spark/conf/ because I though it was strange that it was not in that directory and that solved my problem.
So my advice is to see if hive-site.xml exists in your SPARK_CONF_DIR and if not add it manually. I also find a guide for Horton Works and zeppelin in case this will not work.

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Resources