Spark accumulator not displayed in spark WebUI - apache-spark

I'm using spark streaming. According to the Spark Programming Guide (see http://spark.apache.org/docs/latest/programming-guide.html#accumulators), named accumulators will be displayed in the WebUI as below:
Unfortunately, I cannot find this anywhere. I am registering the accumulators like this (Java):
LongAccumulator accumulator = new LongAccumulator();
ssc.sparkContext.sc().register(accumulator, "my accumulator");
I am using Spark 2.0.0.

I do not have a working streaming example but in non streaming example this UI could be found at the stages tab when choosing a specific stage.
Also, I generally create the accumulator like this:
val accum = sc.longAccumulator("My Accumulator")
The equivalent in for spark streaming would probably be to replace sc with ssc.SparkContext

It worked for me. Below is my sample code
Accumulator<Integer> spansWritten = jsc.sparkContext().intAccumulator(0,"Spans_Written");
JavaDStream<Span> dStream = SourceFactory.getSource().createStream(jsc)
.map( s -> {
spansWritten.add(1);
return s;
});
However, when I tried to use them inside a Decoder while creating stream for kafka, it didn't show up in the UI.
Here is how it looks in the UI (select stages tab from the top, and click on one of the stage)
screen shot

Make sure to register the accumulator at your spark context object:
LongAccumulator accumulator = new LongAccumulator();
ssc.sparkContext().register(accumulator, "my accumulator");

Related

How to work with temporary tables in foreachBatch?

We are building a streaming platform where it is essential to work with SQL's in batches.
val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {
df.createOrReplaceTempView("events")
val df1 = ExecutionContext.getSparkSession.sql("select * from events")
df1.limit(5).show()
// More complex processing on dataframes
}}.trigger(trigger).outputMode(outputMode).start()
query.awaitTermination()
Error thrown is :
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';
Streaming source is Kafka with watermarking and without using Spark-SQL we are able to execute dataframe transformations. Spark version is 2.4.0 and Scala is 2.11.7. Trigger is ProcessingTime every 1 minute and OutputMode is Append.
Is there any other approach to facilitate use of spark-sql within foreachBatch ? Would it work with upgraded version of Spark - in which case to version do we upgrade ?
Kindly help. Thank you.
tl;dr Replace ExecutionContext.getSparkSession with df.sparkSession.
The reason of the StreamingQueryException is that the streaming query tries to access the events temporary table in a SparkSession that knows nothing about it, i.e. ExecutionContext.getSparkSession.
The only SparkSession that has this events temporary table registered is exactly the SparkSession the df dataframe is created within, i.e. df.sparkSession.
Please check the code snippet below. Here, I have created two separate DataFrames, responseDF1 and responseDF2 from resultDF and shown the output in the console. responseDF2 is created using a temporary table. You can try the same.
resultDF.writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
val responseDF1 = batchDF.selectExpr("ResponseObj.type","ResponseObj.key", "ResponseObj.activity", "ResponseObj.price")
responseDF1.show()
responseDF1.createTempView("responseTbl1")
val responseDF2 = batchDF.sparkSession.sql("select activity, key from responseTbl1")
responseDF2.show()
batchDF.sparkSession.catalog.dropTempView("responseTbl1")
batchDF.unpersist()
()}.start().awaitTermination()
Code Snippet

Hide a spark property from displaying in the spark web UI without implementing a security filter

The application web UI at http://:4040 lists Spark properties in the “Environment” tab. All values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. However, for security reasons, I do not want my Cassandra password to display in the web UI. Is there some sort of switch to ensure that certain spark properties are not displayed??
Please note, I see some solutions that suggest implementing a security filter and using spark.ui.filters setting to refer to the class. I am hoping to avoid this complexity.
I think there is no common solution how to hide your custom property from spark WebUI for previous releases.
I assume you are using spark 2.0 or below (i have not seen feature described below in 2.0) because 2.0.1 supports passwords preprocessing to "*****".
Check issue SPARK-16796 Visible passwords on Spark environment page
When we take a look into apache spark source code and do some investigation we can see some processing how to "hide" property in spark web ui.
SparkUI
by default the Environment page is attached within initialization attachTab(new EnvironmentTab(this)) [line 71]
EnvironmentPage renders properties to EnvironmentPage as tab in web gui as:
def render(request: HttpServletRequest): Seq[Node] = {
val runtimeInformationTable = UIUtils.listingTable(
propertyHeader, jvmRow, listener.jvmInformation, fixedWidth = true)
val sparkPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.sparkProperties.map(removePass), fixedWidth = true)
val systemPropertiesTable = UIUtils.listingTable(
propertyHeader, propertyRow, listener.systemProperties, fixedWidth = true)
val classpathEntriesTable = UIUtils.listingTable(
classPathHeaders, classPathRow, listener.classpathEntries, fixedWidth = true)
val content =
<span>
<h4>Runtime Information</h4> {runtimeInformationTable}
<h4>Spark Properties</h4> {sparkPropertiesTable}
<h4>System Properties</h4> {systemPropertiesTable}
<h4>Classpath Entries</h4> {classpathEntriesTable}
</span>
UIUtils.headerSparkPage("Environment", content, parent)
}
all properties are rendered without some kind of hiding preprocessing except sparkProperties - with functionality provided in removePass.
private def removePass(kv: (String, String)): (String, String) = {
if (kv._1.toLowerCase.contains("password")) (kv._1, "******") else kv
}
as we can see every key that contains "password" (BTW: in the master branch they also filtering keys with keyword "secret" check if u are interested in)
I cannot tested now but u can try to update spark. so eg. SparkSubmitArguments.scala in mergeDefaultSparkProperties() will consider spark.cassandra.auth.password as spark and populate as sparkProperties (with removePass preprocessing).
And at the end of the day in EnvironmentTab in web gui this property should be visible as ****.

My Accumulable is not shown in the Spark Web UI

I created a case class A and try to add a new Accumulable to Spark:
sc.accumulable(A(), "My accumulable description")
Now when Spark is running I cannot see my Accumulable[A, B] in the Spark Web UI. What could be wrong?
One possible reason is that you did not implement equals method for A correctly. Spark relies on correct implementation of equals at least when it tries to find out whether your Accumulable was updated. An example of the comparison can be found in DAGScheduler.scala:updateAccumulators in this line of code:
if (acc.name.isDefined && partialValue != acc.zero) {
where partialValue is updated value of the Accumulable and acc.zero is initial value for the same Accumulable

MemSQL Spark Job

I am trying to read a CSV file in Spark job using MemSQL Extractor and do some enrichment using Transformer and load to MemSQL Database using Java.
I see there is memsql-spark interface jar but not finding any useful Java API documentation or example.
I have started writing extractor to read from CSV but I dont know how to move further.
public Option<RDD<byte[]>> nextRDD(SparkContext sparkContext, UserExtractConfig config, long batchInterval, PhaseLogger logger) {
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<byte[]> bytes = inputFile.map(ByteUtils.utf8StringToBytes(filePath), String.class); //compilation error
return bytes; //compilation error
}
Would appreciate if someone can point me to some direction to get started...
thanks...
First configure Spark connector in java using following code:
SparkConf conf = new SparkConf();
conf.set("spark.datasource.singlestore.clientEndpoint", "singlestore-host")
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", "s3cur3-pa$$word")
After running the above code spark will be connected to java then you can read csv in spark dataframe. You can transform and manipulate data according to requirements then you can write this dataframe to Database table.
Also attaching link for your reference.
spark-singlestore.

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.
If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
Spark 2.1+
spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)
Yes: sc.getConf().getAll()
Which uses the method:
SparkConf.getAll()
as accessed by
SparkContext.sc.getConf()
See it in action:
In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Spark 1.6+
sc.getConf.getAll.foreach(println)
For a complete overview of your Spark environment and configuration I found the following code snippets useful:
SparkContext:
for item in sorted(sc._conf.getAll()): print(item)
Hadoop Configuration:
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)
Environment variables:
import os
for item in sorted(os.environ.items()): print(item)
Simply running
sc.getConf().getAll()
should give you a list with all settings.
Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:
The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.
(These three methods all return the same data on my cluster.)
For Spark 2+ you can also use when using scala
spark.conf.getAll; //spark as spark session
You can use:
sc.sparkContext.getConf.getAll
For example, I often have the following at the top of my Spark programs:
logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))
Just for the records the analogous java version:
Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}
Suppose I want to increase the driver memory in runtime using Spark Session:
s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()
Now I want to view the updated settings:
s2.conf.get("spark.driver.memory")
To get all the settings, you can make use of spark.sparkContext._conf.getAll()
Hope this helps
Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:
from pyspark import SparkFiles
print SparkFiles.getRootDirectory()
If you want to see the configuration in data bricks use the below command
spark.sparkContext._conf.getAll()
I would suggest you try the method below in order to get the current spark context settings.
SparkConf.getAll()
as accessed by
SparkContext.sc._conf
Get the default configurations specifically for Spark 2.1+
spark.sparkContext.getConf().getAll()
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Resources