Hide a spark property from displaying in the spark web UI without implementing a security filter - apache-spark

The application web UI at http://:4040 lists Spark properties in the “Environment” tab. All values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. However, for security reasons, I do not want my Cassandra password to display in the web UI. Is there some sort of switch to ensure that certain spark properties are not displayed??
Please note, I see some solutions that suggest implementing a security filter and using spark.ui.filters setting to refer to the class. I am hoping to avoid this complexity.

I think there is no common solution how to hide your custom property from spark WebUI for previous releases.
I assume you are using spark 2.0 or below (i have not seen feature described below in 2.0) because 2.0.1 supports passwords preprocessing to "*****".
Check issue SPARK-16796 Visible passwords on Spark environment page
When we take a look into apache spark source code and do some investigation we can see some processing how to "hide" property in spark web ui.
SparkUI
by default the Environment page is attached within initialization attachTab(new EnvironmentTab(this)) [line 71]
EnvironmentPage renders properties to EnvironmentPage as tab in web gui as:
def render(request: HttpServletRequest): Seq[Node] = {
val runtimeInformationTable = UIUtils.listingTable(
propertyHeader, jvmRow, listener.jvmInformation, fixedWidth = true)
val sparkPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.sparkProperties.map(removePass), fixedWidth = true)
val systemPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.systemProperties, fixedWidth = true)
val classpathEntriesTable = UIUtils.listingTable(
classPathHeaders, classPathRow, listener.classpathEntries, fixedWidth = true)
val content =
<span>
<h4>Runtime Information</h4> {runtimeInformationTable}
<h4>Spark Properties</h4> {sparkPropertiesTable}
<h4>System Properties</h4> {systemPropertiesTable}
<h4>Classpath Entries</h4> {classpathEntriesTable}
</span>
UIUtils.headerSparkPage("Environment", content, parent)
}
all properties are rendered without some kind of hiding preprocessing except sparkProperties - with functionality provided in removePass.
private def removePass(kv: (String, String)): (String, String) = {
if (kv._1.toLowerCase.contains("password")) (kv._1, "******") else kv
}
as we can see every key that contains "password" (BTW: in the master branch they also filtering keys with keyword "secret" check if u are interested in)
I cannot tested now but u can try to update spark. so eg. SparkSubmitArguments.scala in mergeDefaultSparkProperties() will consider spark.cassandra.auth.password as spark and populate as sparkProperties (with removePass preprocessing).
And at the end of the day in EnvironmentTab in web gui this property should be visible as ****.

Related

What is the purpose of global temporary views?

Trying to understand how to use the Spark Global Temporary Views.
In one spark-shell session I've created a view
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
df = (
spark.read.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("/user/root/data/cars.csv"))
df.createGlobalTempView("my_cars")
# works without any problem
spark.sql("SELECT * FROM global_temp.my_cars").show()
And on another I tried to access it, without success (table or view not found).
#second Spark Shell
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
spark.sql("SELECT * FROM global_temp.my_cars").show()
That's the error I receive :
pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`my_cars`; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`my_cars`\n"
I've read that each spark-shell has its own context, and that's why one spark-shell cannot see the other. So I don't understand, what's the usage of the GTV, where will it be useful ?
Thanks
in the spark documentation you can see:
If you want to have a temporary view that is shared among all sessions
and keep alive until the Spark application terminates, you can create
a global temporary view.
The global table remains accessible as long as the application is alive.
Opening a new shell and giving it the same application will just create a new application.
you can try and test it within the same shell:
spark.newSession.sql("SELECT * FROM global_temp.my_cars").show()
please see my answer on a similar question for a more detailed example as well as a short definition of a Spark Application and Spark Session
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it,
df.createGlobalTempView("people")

My Accumulable is not shown in the Spark Web UI

I created a case class A and try to add a new Accumulable to Spark:
sc.accumulable(A(), "My accumulable description")
Now when Spark is running I cannot see my Accumulable[A, B] in the Spark Web UI. What could be wrong?
One possible reason is that you did not implement equals method for A correctly. Spark relies on correct implementation of equals at least when it tries to find out whether your Accumulable was updated. An example of the comparison can be found in DAGScheduler.scala:updateAccumulators in this line of code:
if (acc.name.isDefined && partialValue != acc.zero) {
where partialValue is updated value of the Accumulable and acc.zero is initial value for the same Accumulable

Logging Spark Configuration Properties

I'm trying to log the properties for each Spark application that run in one Yarn cluster ( properties like spark.shuffle.compress, spark.reducer.maxMbInFlight, spark.executor.instances and so on ).
However i don't know if this information is logged anywhere. I know that we can access to the yarn logs through the "yarn" command but the properties I'm talking about are not store there.
Is there anyway to access to this kind of info?. The idea is to have a trace of all the applications that run in the cluster together with its properties to identify which ones have the most impact in their execution time.
You could log it yourself... use sc.getConf.toDebugString, sqlContext.getConf("") or sqlContext.getAllConfs.
scala> sqlContext.getConf("spark.sql.shuffle.partitions")
res129: String = 200
scala> sqlContext.getAllConfs
res130: scala.collection.immutable.Map[String,String] = Map(hive.server2.thrift.http.cookie.is.httponly -> true, dfs.namenode.resource.check.interval ....
scala> sc.getConf.toDebugString
res132: String =
spark.app.id=local-1449607289874
spark.app.name=Spark shell
spark.driver.host=10.5.10.153
Edit: However, I could not find the properties you specified among the 1200+ properties in sqlContext.getAllConfs :( Otherwise the documentation says:
The application web UI at http://:4040 lists Spark properties
in the “Environment” tab. This is a useful place to check to make sure
that your properties have been set correctly. Note that only values
explicitly specified through spark-defaults.conf, SparkConf, or the
command line will appear. For all other configuration properties, you
can assume the default value is used.

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Spark accumulator not displayed in spark WebUI

I'm using spark streaming. According to the Spark Programming Guide (see http://spark.apache.org/docs/latest/programming-guide.html#accumulators), named accumulators will be displayed in the WebUI as below:
Unfortunately, I cannot find this anywhere. I am registering the accumulators like this (Java):
LongAccumulator accumulator = new LongAccumulator();
ssc.sparkContext.sc().register(accumulator, "my accumulator");
I am using Spark 2.0.0.
I do not have a working streaming example but in non streaming example this UI could be found at the stages tab when choosing a specific stage.
Also, I generally create the accumulator like this:
val accum = sc.longAccumulator("My Accumulator")
The equivalent in for spark streaming would probably be to replace sc with ssc.SparkContext
It worked for me. Below is my sample code
Accumulator<Integer> spansWritten = jsc.sparkContext().intAccumulator(0,"Spans_Written");
JavaDStream<Span> dStream = SourceFactory.getSource().createStream(jsc)
.map( s -> {
spansWritten.add(1);
return s;
});
However, when I tried to use them inside a Decoder while creating stream for kafka, it didn't show up in the UI.
Here is how it looks in the UI (select stages tab from the top, and click on one of the stage)
screen shot
Make sure to register the accumulator at your spark context object:
LongAccumulator accumulator = new LongAccumulator();
ssc.sparkContext().register(accumulator, "my accumulator");

Resources