Spark SQL - org.apache.spark.sql.AnalysisException - apache-spark

The error described below occurs when I run Spark job on Databricks the second time (the first less often).
The sql query just performs create table as select from registered temp view from DataFrame.
The first idea was spark.catalog.clearCache() in the end of the job (did't help).
Also I found some post on databricks forum about using object ... extends App (Scala) instead of main method (didn't help again)
P.S. current_date() is the built-in function and it should be provided automatically (expected)
Spark 2.4.4, Scala 2.11, Databricks Runtime 6.2
org.apache.spark.sql.AnalysisException: Undefined function: 'current_date'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 21 pos 4
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1317)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)```

Solution, ensure spark initialized every time when job is executed.
TL;DR,
I had similar issue and that object extends App solution pointed me in right direction. So, in my case I was creating spark session outside of the "main" but within object and when job was executed first time cluster/driver loaded jar and initialised spark variable and once job has finished execution successfully (first time) the jar is kept it in memory but link to spark is lost for some reason and any subsequent execution does not reinitialize spark as jar is already loaded and in my case spark initilisation was outside main and hence was not re-initilised. I think it's not an issue for Databricks jobs that create cluster and run or start cluster before execution (as these are similar to first time start case) and only related to clusters that already up and running as jars are loaded during either cluster start up or job execution.
So, I moved spark creation i.e. SparkSession.builder()...getOrCreate() to the "main" and so when job called so does spark session gets reinitialized.

current_date() is the built-in function and it should be provided
automatically (expected)
This expectation is wrong. you have to import the functions
for scala
import org.apache.spark.sql.functions._
where current_date function is available.
from pyspark.sql import functions as F
for pyspark

Related

PySpark job only partly running

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.

Does Spark Sql executions use thread local jobgroup?

From my findings running multiple sparksqls with different job groups does not put them in the specified groups.
https://issues.apache.org/jira/browse/SPARK-29340
Creating new threadlocal jobgroup works for spark dataframe jobs but not for sparksql. Is there a way to put all threadlocal spark sql executions in a separate jobgroup?
val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()
sparkThreadLocal.sparkContext.setJobGroup("<id>", "<description>")
OR
sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", "<id>")
sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", "<description>")
Solved! It was an issue with using scala parallel iteration, which uses threadpools.

Can't find a proper way to close Java Spark session(with spark version 2.2.1)

I'm deploy a spark on yarn driver java application,it will submit spark jobs(mainly doing some offline statistics over hive,elasticsearch and hbase) to cluster when Task Scheduling System gives it a task call.So I make this driver app keep running,always wait for request.
I use thread pool to handle task calls,Every task will open a new
SparkSession and close it when job finished(we skip multiple tasks
call at the same time scenario to simplify this question).Java code should be like this:
SparkSession sparkSession=SparkSession.builder().config(new SparkConf().setAppName(appName)).enableHiveSupport().getOrCreate();
......doing statistics......
sparkSession.close();
This app compiled and running under jdk8 and memory related configured as fellow:
spark.ui.enabled=false
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.driver.memory=2G
--driver-java-options "-XX:MaxDirectMemorySize=2048M -XX:+UseG1GC"
At first glance I thought this driver app will consumes at most 4G memories,but as it keeps running,TOP shows it takes more and more resident size.
I dumped it's heap file and saw many Spark related instances left in threadLocal after sparksession closed,such as Hive metastore,SparkSession itself.After many study,I find Spark using a lot of threadlocals and havn't remove them(or I just havn't use the right way to close sparksession) I add those codes to clear threadlocals that spark left behind:
import org.apache.hadoop.hive.ql.metadata.Hive;
import org.apache.hadoop.hive.ql.session.SessionState;
......
SparkSession.clearDefaultSession();
sparkSession.close();
Hive.closeCurrent();
SessionState.detachSession();
SparkSession.clearActiveSession();
This seems work for now,but I think it's just not decent enough,I'm wondering is there a better way to do it like another single spark java api can do all the cleaning?I just can't find a clue from spark document.

How many SparkSessions can a single application have?

I have found that as Spark runs, and tables grow in size (through Joins) that the spark executors will eventually run out of memory and the entire system crashes. Even if I try to write temporary results to Hive tables (on HDFS), the system still doesn't free much memory, and my entire system crashes after about 130 joins.
However, through experimentation, I realized that if I break the problem into smaller pieces, write temporary results to hive tables, and Stop/Start the Spark session (and spark context), then the system's resources are freed. I was able to join over 1,000 columns using this approach.
But I can't find any documentation to understand if this is considered a good practice or not (I know you should not acquire multiple sessions at once). Most systems acquire the session in the beginning and close it in the end. I could also break the application into smaller ones, and use a driver like Oozie to schedule these smaller applications on Yarn. But this approach would start and stop the JVM at each stage, which seems a bit heavy-weight.
So my question: is it bad practice to continually start/stop the spark session to free system resources during the run of a single spark application?
But can you elaborate on what you mean by a single SparkContext on a single JVM? I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession. I then created a new SparkSession and used a new sparkContext. No error was thrown.
I was also able to use this on the JavaSparkPi without any problems.
I have tested this in yarn-client and a local spark install.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
TL;DR You can have as many SparkSessions as needed.
You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.
But can you elaborate on what you mean by a single SparkContext on a single JVM?
It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.
The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).
You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).
In other words, have a single SparkContext for the entire lifetime of your Spark application.
There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.
I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession.
So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?
SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.
From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).
If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.
I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.
// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]
// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)
scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2
// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100
import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]
I then created a new SparkSession and used a new SparkContext. No error was thrown.
Again, how does this contradict what I said about a single SparkContext being available? I'm curious.
What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?
You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?
Try the following:
Stop SparkContext
Execute any processing using Spark Core's RDD or Spark SQL's Dataset APIs
An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

SnappyData - snappy-job - cannot run jar file

I'm trying run jar file from snappydata cli.
I'm just want to create a sparkSession and SnappyData session on beginning.
package io.test
import org.apache.spark.sql.{SnappySession, SparkSession}
object snappyTest {
def main(args: Array[String]) {
val spark: SparkSession = SparkSession
.builder
.appName("SparkApp")
.master("local")
.getOrCreate
val snappy = new SnappySession(spark.sparkContext)
}
}
From sbt file:
name := "SnappyPoc"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.0.0"
When I'm debuging code in IDE, it works fine, but when I create a jar file and try to run it directly on snappy I get message:
"message": "Ask timed out on [Actor[akka://SnappyLeadJobServer/user/context-supervisor/snappyContext1508488669865777900#1900831413]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
I have Spark Standalone 2.1.1, SnappyData 1.0.0.
I added dependencies to Spark instance.
Could you help me ?. Thank in advanced.
The difference between "embedded" mode and "smart connector" mode needs to be explained first.
Normally when you run a job using spark-submit, then it spawns a set of new executor JVMs as per configuration to run the code. However in the embedded mode of SnappyData, the nodes hosting the data also host long-running Spark Executors themselves. This is done to minimize data movement (i.e. move execution rather than data). For that mode you can submit a job (using snappy-job.sh) which will run the code on those pre-existing executors. Alternative routes include the JDBC/ODBC for embedded execution. This also means that you cannot (yet) use spark-submit to run embedded jobs because that will spawn its own JVMs.
The "smart connector" mode is the normal way in which other Spark connectors work but like all those has the disadvantage of having to pull the required data into the executor JVMs and thus will be slower than embedded mode. For configuring the same, one has to specify "snappydata.connection" property to point to the thrift server running on SnappyData cluster's locator. It is useful for many cases where users want to expand the execution capacity of cluster (e.g. if cluster's embedded execution is saturated all the time on CPU), or for existing Spark distributions/deployments. Needless to say that spark-submit can work in the connector mode just fine. What is "smart" about this mode is: a) if physical nodes hosting the data and running executors are common, then partitions will be routed to those executors as much as possible to minimize network usage, b) will use the optimized SnappyData plans to scan the tables, hash aggregation, hash join.
For this specific question, the answer is: runSnappyJob will receive the SnappySession object as argument which should be used rather than creating it. Rest of the body that uses SnappySession will be exactly same. Likewise for working with base SparkContext, it might be easier to implement SparkJob and code will be similar except that SparkContext will be provided as function argument which should be used. The reason being as explained above: embedded mode already has a running SparkContext which needs to be used for jobs.
I think there were missing methods isValidJob and runSnappyJob.
When I added those to code it works, but know someone what is releation beetwen body of metod runSnappyJob and method main
Should be the same in both ?

Resources