I'm trying to figure out why activity that I know is occurring isn't showing up in the SQL tab of the Spark UI. I am using Spark 1.6.0.
For example, we have a load of activity occurring today between 11:06 & 13:17, and I know for certain that the code being executed is using Spark dataframes API:
Yet if I hop over to the SQL tab I don't see any activity occurring for between those times:
So...I'm trying to figure out what influences whether not activity appears in that SQL tab, because the information presented in that SQL tab is (arguably) the most useful information in the whole UI - and when there's activity occurring that isn't showing up it becomes kinda annoying. The only distinguishing characteristic seems to be that the jobs that are showing up in the SQL tab use actions that don't write any data (e.g. count()), the jobs that do write data don't seem to be showing up. I'm puzzled as to why.
Any pearls of wisdom?
Related
I am working on a sparksql progress bar, which in the desire state would display the progress of a spark sql query.
However, the current spark library/spark API endpoints are only able to show the jobs in RUNNING/FINISHED state.
Hence, I want to find out if there is anyway to predict the number of jobs based on optimized logical plan/physical plan so that I can make an actual progress bar rather than the current showConsoleProgress one which doesn't tell much other than it's running.
I'm trying to optimize one program with Spark SQL, this program is basically a HUGE SQL query (joins like 10 tables with many cases etc etc). I'm more used to more DF-API-oriented programs, and those did show the different stages much better.
It's quite well structured and I understand it more or less. However I have a problem, I always use Spark UI SQL view to get hints on where to focus the optimizations.
However in this kind of program Spark UI SQL shows nothing, is there a reason for this? (or a way to force it to show).
I'm expecting to see each join/scan with the number of output rows after it and such.... but I only see a full "WholeStageCodeGen" for a "Parsed logical plan" which is like 800lines
I can't show code, it has the following "points":
1- Action triggering it, its "show"(20)
3- Takes like 1 hour of execution (few executors yet)
2- has a persist before the show/action.
3- Uses Kudu, Hive and In-memory tables (registered before this query)
4- Has like 700 lines logical plan
Is there a way to improve the tracing there? (maybe disabling WholeStageCodegen?, but that may hurt performance...)
Thanks!
I wanted to analyze sql queries executed by users from spark. I checked spark history server logs. And seems like it logs only info partially. For example when I run select statements. But doesnt log statements like create, drop or for example when I do INSERT INTO TABLE SELECT....., then it just logs the select statement. But doesnt say to which table the data was inserted. I am wondering if there is something wrong in my logs settings or this is correct behaviour. If yes, do you know what would be the best way to get historical data of queries running thru spark.
Thanks
We are trying to build a scenario where based on some selection parameters in a
reporting tool (lets say tableau) a spark program needs to be executed which performs some market basket analysis on a data set against the selection parameters. The result from the program then needs to be displayed in reporting tool.
We are not able figure out how to trigger the spark program once the user enters
selection parameters in the reporting tool (basically the linkage between reporting tool and spark program). Any pointers in this regard would help a lot.
Thanks!
if you are looking for the steps to connect Spark sql with Tableau . If you wanted to do the pre processing then you have to do it in source side .
Example Take an example of source hive with Tableau. Then you have to create the view or data massaging in Hive side.
If you are using Tableau Server, you can use the Tableau JavaScript API to call a function you write when the user makes a selection. The API also has functions your code can call to refresh or display a viz.
When I run a job on Apache Spark, the web UI gives a view similar to this:
While this is incredibly useful for me as a developer to see where things are, I think the line numbers in the stage description would be not quite as useful for my support team. To make their job easier, I would like to have the ability to provide a bespoke name for each stage of my job, as well as for the job itself, like so:
Is this something that can be done in Spark? If so, how would I do so?
That's where one of the very uncommon features of Spark Core called local properties applies so well.
Spark SQL uses it to group different Spark jobs under a single structured query so you can use SQL tab and navigate easily.
You can control local properties using SparkContext.setLocalProperty:
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. User-defined properties may also be set here. These properties are propagated through to worker tasks and can be accessed there via org.apache.spark.TaskContext#getLocalProperty.
web UI uses two local properties:
callSite.short in Jobs tab (and is exactly what you want)
callSite.long in Job Details page.
Sample Usage
scala> sc.setLocalProperty("callSite.short", "callSite.short")
scala> sc.setLocalProperty("callSite.long", "this is callSite.long")
scala> sc.parallelize(0 to 9).count
res2: Long = 10
And the result in web UI.
Click a job to see the details where you can find the longer call site, i.e. callSite.long.
Here comes the Stages tab.
You can use the following API(s) to set and unset the stage names.
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#setCallSite-java.lang.String-
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#clearCallSite--
Also, Spark supports the concept of Job Groups within the application, following API(s) can be used to set and unset the job group names.
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#setJobGroup-java.lang.String-java.lang.String-boolean-
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#clearJobGroup--
The job description within the job group can also be configured using following API.
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#setJobDescription-java.lang.String-
In pyspark, you could use snippet below:
sc.setJobDescription('test job')
spark.createDataFrame([(1,)],('Id integer')).show()