What is the purpose of global temporary views? - apache-spark

Trying to understand how to use the Spark Global Temporary Views.
In one spark-shell session I've created a view
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
df = (
spark.read.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("/user/root/data/cars.csv"))
df.createGlobalTempView("my_cars")
# works without any problem
spark.sql("SELECT * FROM global_temp.my_cars").show()
And on another I tried to access it, without success (table or view not found).
#second Spark Shell
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
spark.sql("SELECT * FROM global_temp.my_cars").show()
That's the error I receive :
pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`my_cars`; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`my_cars`\n"
I've read that each spark-shell has its own context, and that's why one spark-shell cannot see the other. So I don't understand, what's the usage of the GTV, where will it be useful ?
Thanks

in the spark documentation you can see:
If you want to have a temporary view that is shared among all sessions
and keep alive until the Spark application terminates, you can create
a global temporary view.
The global table remains accessible as long as the application is alive.
Opening a new shell and giving it the same application will just create a new application.
you can try and test it within the same shell:
spark.newSession.sql("SELECT * FROM global_temp.my_cars").show()
please see my answer on a similar question for a more detailed example as well as a short definition of a Spark Application and Spark Session

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it,
df.createGlobalTempView("people")

Related

Where are the spark intermediate files stored on the disk?

During a shuffle, the mappers dump their outputs to the local disk from where it gets picked up by the reducers. Where exactly on the disk are those files dumped? I am running pyspark cluster on YARN.
What I have tried so far:
I think the possible locations where the intermediate files could be are (In the decreasing order of likelihood):
hadoop/spark/tmp. As per the documentation at the LOCAL_DIRS env variable that gets defined by the yarn.
However, post starting the cluster (I am passing master --yarn) I couldn't find any LOCAL_DIRS env variable using os.environ but, I can see SPARK_LOCAL_DIRS which should happen only in case of mesos or standalone as per the documentation (Any idea why that might be the case?). Anyhow, my SPARK_LOCAL_DIRS is hadoop/spark/tmp
tmp. Default value of spark.local.dir
/home/username. I have tried sending custom value to spark.local.dir while starting the pyspark using --conf spark.local.dir=/home/username
hadoop/yarn/nm-local-dir. This is the value of yarn.nodemanager.local-dirs property in yarn-site.xml
I am running the following code and checking for any intermediate files being created at the above 4 locations by navigating to each location on a worker node.
The code I am running:
from pyspark import storagelevel
df_sales = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_products = spark.read.load("gs://monsoon-credittech.appspot.com/spark_datasets/products_parquet")
df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,'inner')
df_merged.persist(storagelevel.StorageLevel.DISK_ONLY)
df_merged.count()
There are no files that are being created at any of the 4 locations that I have listed above
As suggested in one of the answers, I have tried getting the directory info in the terminal the following way:
At the end of log4j.properties file located at $SPARK_HOME/conf/ add log4j.logger.or.apache.spark.api.python.PythonGatewayServer=INFO
This did not help. The following is the screenshot of my terminal with logging set to INFO
Where are the spark intermediate files (output of mappers, persist etc) stored?
Without getting into the weeds of Spark source, perhaps you can quickly check it live. Something like this:
>>> irdd = spark.sparkContext.range(0,100,1,10)
>>> def wherearemydirs(p):
... import os
... return os.getenv('LOCAL_DIRS')
...
>>>
>>> irdd.map(wherearemydirs).collect()
>>>
...will show local dirs in terminal
/data/1/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/10/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,/data/11/yarn/nm/usercache//appcache/<application_xxxxxxxxxxx_xxxxxxx>,...
But yes, it will basically point to the parent dir (created by YARN) of UUID-randomized subdirs created by DiskBlockManager, as #KoedIt mentioned:
:
23/01/05 10:15:37 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/<your-user-id>/appcache/application_xxxxxxxxx_xxxxxxx/blockmgr-d4df4512-d18b-4dcf-8197-4dfe781b526a
:
This is going to depend on what your cluster setup is and your Spark version, but you're more or less looking at the correct places.
For this explanation, I'll be talking about Spark v3.3.1. which is the latest version as of the time of this post.
There is an interesting method in org.apache.spark.util.Utils called getConfiguredLocalDirs and it looks like this:
/**
* Return the configured local directories where Spark can write files. This
* method does not create any directories on its own, it only encapsulates the
* logic of locating the local directories according to deployment mode.
*/
def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
val shuffleServiceEnabled = conf.get(config.SHUFFLE_SERVICE_ENABLED)
if (isRunningInYarnContainer(conf)) {
// If we are in yarn mode, systems can have different disk layouts so we must set it
// to what Yarn on this system said was available. Note this assumes that Yarn has
// created the directories already, and that they are secured so that only the
// user has access to them.
randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
conf.getenv("SPARK_LOCAL_DIRS").split(",")
} else if (conf.getenv("MESOS_SANDBOX") != null && !shuffleServiceEnabled) {
// Mesos already creates a directory per Mesos task. Spark should use that directory
// instead so all temporary files are automatically cleaned up when the Mesos task ends.
// Note that we don't want this if the shuffle service is enabled because we want to
// continue to serve shuffle files after the executors that wrote them have already exited.
Array(conf.getenv("MESOS_SANDBOX"))
} else {
if (conf.getenv("MESOS_SANDBOX") != null && shuffleServiceEnabled) {
logInfo("MESOS_SANDBOX available but not using provided Mesos sandbox because " +
s"${config.SHUFFLE_SERVICE_ENABLED.key} is enabled.")
}
// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
// configuration to point to a secure directory. So create a subdirectory with restricted
// permissions under each listed directory.
conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
}
}
This is interesting, because it makes us understand the order of precedence each config setting has. The order is:
if running in Yarn, getYarnLocalDirs should give you your local dir, which depends on the LOCAL_DIRS environment variable
if SPARK_EXECUTOR_DIRS is set, it's going to be one of those
if SPARK_LOCAL_DIRS is set, it's going to be one of those
if MESOS_SANDBOX and !shuffleServiceEnabled, it's going to be MESOS_SANDBOX
if spark.local.dir is set, it's going to be that
ELSE (catch-all) it's going to be java.io.tmpdir
IMPORTANT: In case you're using Kubernetes, all of this is disregarded and this logic is used.
Now, how do we find this directory?
Luckily, there is a nicely placed logging line in DiskBlockManager.createLocalDirs which prints out this directory if your logging level is INFO.
So, set your default logging level to INFO in log4j.properties (like so), restart your spark application and you should be getting a line saying something like
Created local directory at YOUR-DIR-HERE

Spark SQL persistent view over jdbc data source

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.
I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.
Config.
tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
This works.
query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> DataFrame[]
This does not work.
query = f"""
create or replace view {tbl_out}
using jdbc
options(
dbtable '{tbl_in}',
user '{db_user}',
password '{db_pw}',
url '{jdbc_url}'
)
"""
spark.sql(query)
> ParseException:
Error.
ParseException:
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)
== SQL ==
create or replace view myview
using jdbc
^^^
options(
dbtable 'myjdbctable',
user 'myuser',
password '[REDACTED]',
url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
)
TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.
It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.
It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:
if you add a column in underlying jdbc table, it will not show up in spark sql table
if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
if you remove the underlying jdbc table, an error will occur when spark sql table is accessed
The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.
In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.
If you want to create a temp view in spark, do it using spark syntaxe createTempView

Hide a spark property from displaying in the spark web UI without implementing a security filter

The application web UI at http://:4040 lists Spark properties in the “Environment” tab. All values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. However, for security reasons, I do not want my Cassandra password to display in the web UI. Is there some sort of switch to ensure that certain spark properties are not displayed??
Please note, I see some solutions that suggest implementing a security filter and using spark.ui.filters setting to refer to the class. I am hoping to avoid this complexity.
I think there is no common solution how to hide your custom property from spark WebUI for previous releases.
I assume you are using spark 2.0 or below (i have not seen feature described below in 2.0) because 2.0.1 supports passwords preprocessing to "*****".
Check issue SPARK-16796 Visible passwords on Spark environment page
When we take a look into apache spark source code and do some investigation we can see some processing how to "hide" property in spark web ui.
SparkUI
by default the Environment page is attached within initialization attachTab(new EnvironmentTab(this)) [line 71]
EnvironmentPage renders properties to EnvironmentPage as tab in web gui as:
def render(request: HttpServletRequest): Seq[Node] = {
val runtimeInformationTable = UIUtils.listingTable(
propertyHeader, jvmRow, listener.jvmInformation, fixedWidth = true)
val sparkPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.sparkProperties.map(removePass), fixedWidth = true)
val systemPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.systemProperties, fixedWidth = true)
val classpathEntriesTable = UIUtils.listingTable(
classPathHeaders, classPathRow, listener.classpathEntries, fixedWidth = true)
val content =
<span>
<h4>Runtime Information</h4> {runtimeInformationTable}
<h4>Spark Properties</h4> {sparkPropertiesTable}
<h4>System Properties</h4> {systemPropertiesTable}
<h4>Classpath Entries</h4> {classpathEntriesTable}
</span>
UIUtils.headerSparkPage("Environment", content, parent)
}
all properties are rendered without some kind of hiding preprocessing except sparkProperties - with functionality provided in removePass.
private def removePass(kv: (String, String)): (String, String) = {
if (kv._1.toLowerCase.contains("password")) (kv._1, "******") else kv
}
as we can see every key that contains "password" (BTW: in the master branch they also filtering keys with keyword "secret" check if u are interested in)
I cannot tested now but u can try to update spark. so eg. SparkSubmitArguments.scala in mergeDefaultSparkProperties() will consider spark.cassandra.auth.password as spark and populate as sparkProperties (with removePass preprocessing).
And at the end of the day in EnvironmentTab in web gui this property should be visible as ****.

Logging Spark Configuration Properties

I'm trying to log the properties for each Spark application that run in one Yarn cluster ( properties like spark.shuffle.compress, spark.reducer.maxMbInFlight, spark.executor.instances and so on ).
However i don't know if this information is logged anywhere. I know that we can access to the yarn logs through the "yarn" command but the properties I'm talking about are not store there.
Is there anyway to access to this kind of info?. The idea is to have a trace of all the applications that run in the cluster together with its properties to identify which ones have the most impact in their execution time.
You could log it yourself... use sc.getConf.toDebugString, sqlContext.getConf("") or sqlContext.getAllConfs.
scala> sqlContext.getConf("spark.sql.shuffle.partitions")
res129: String = 200
scala> sqlContext.getAllConfs
res130: scala.collection.immutable.Map[String,String] = Map(hive.server2.thrift.http.cookie.is.httponly -> true, dfs.namenode.resource.check.interval ....
scala> sc.getConf.toDebugString
res132: String =
spark.app.id=local-1449607289874
spark.app.name=Spark shell
spark.driver.host=10.5.10.153
Edit: However, I could not find the properties you specified among the 1200+ properties in sqlContext.getAllConfs :( Otherwise the documentation says:
The application web UI at http://:4040 lists Spark properties
in the “Environment” tab. This is a useful place to check to make sure
that your properties have been set correctly. Note that only values
explicitly specified through spark-defaults.conf, SparkConf, or the
command line will appear. For all other configuration properties, you
can assume the default value is used.

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Resources