Spark throws AnalysisException: Undefined function: 'count' for spark built in function - apache-spark

If i run the following code in spark ( 2.3.2.0-mapr-1901) , it runs fine on the first run.
SELECT count( `cpu-usage` ) as `cpu-usage-count` , sum( `cpu-usage` ) as `cpu-usage-sum` , percentile_approx( `cpu-usage`, 0.95 ) as `cpu-usage-approxPercentile`
FROM filtered_set
Where filtered_set is a DataFrame that has been registered as a temp view using createOrReplaceTempView.
I get a result and all is good on the first call. But...
If i then run this job again, ( note that this is a shared spark context, managed via apache livy), Spark throws:
Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: Undefined function: 'count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 10
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1216)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1216)
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1215)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1213)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
...
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
This problem occurs on the second run of the Livy job ( which is using the previous Spark session). It is not isolated just to the count function, (etc also happens with sum, etc) and any function appears to fail on the second run, regardless of what was called in the first run.
It seems like Spark's function registry is being cleared out ( including the default built in functions). We're not doing anything with the spark context.
Questions:
- Is this expected or normal behaviour with spark?
- How would I re-set or initialise the spark session so it doesn't lose all these functions?
I have seen Undefined function errors described elsewhere in terms of user defined functions but never the built ins.

Related

ARRAY_AGG function does not work in Spark SQL

I a trying to use ARRAY_AGG function in Spark SQL. When I use it, it throws error
<<Undefined function: 'array_agg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default>>
Dataset<Row> finalDS1 = sparkSession.sql("select array_agg(company_private_id) from TEMP_COMPANY_PRIVATE_VIEW");
Anyone know how to solve it? I am trying to compare one array with another column. For that I am using ARRAY_AGG.
"select cp.array_column & (select array_agg(int_column) from getCompanyPrivateDS ds1) as filtered_data from getCompanyPrivateDS cp"
I think this is a documentation error by Spark. They clearly show array_agg() in their function list: https://spark.apache.org/docs/latest/api/sql/index.html#array_agg
but I have also experienced that this function doesn't work on Spark 3.1.2
Collect_set() and collect_list() should work for your purposes: the former dedupes results, while the latter doesn't.

AWS Glue reading data from Sybase table

While loading data from Sybase DB in AWS Glue I encounter an error:
Py4JJavaError: An error occurred while calling o261.load.
: java.sql.SQLException: The identifier that starts with '__SPARK_GEN_JDBC_SUBQUERY_NAME' is too long. Maximum length is 30.
The code I use is:
spark.read.format("jdbc").
option("driver", "net.sourceforge.jtds.jdbc.Driver").
option("url", jdbc_url).
option("query", query).
option("user", db_username).
option("password", db_password).
load()
Is there any way to set this identifier as a custom one in order to have it shorter? What's interesting I am able to load all the data from a particular table by replacing query option with option("dbtable", table) but invoking a custom query is impossible.
Best Regards

org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'

spark (2.4.5) is throwing the following error when trying to execute a select query similar to one shown below.
org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 12
SELECT cast(coalesce(column1,'') as string) as id,cast(coalesce(column2,'2020-01-01') as date) as date1
from 4dea68ed921940e58f027e7146d495a4
Table 4dea68ed921940e58f027e7146d495a4 is a temp view created in spark from dataframe.
This error is happening intermittently only after certain processes. Any help would be much appreciated.
The spark job is submitted through livy. Job contains two optional parameters and only one was provided. Providing all the parameters resolved the issue. Don't know why not providing an optional parameter caused this weird behavior but resolved the issue

What is the purpose of global temporary views?

Trying to understand how to use the Spark Global Temporary Views.
In one spark-shell session I've created a view
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
df = (
spark.read.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("/user/root/data/cars.csv"))
df.createGlobalTempView("my_cars")
# works without any problem
spark.sql("SELECT * FROM global_temp.my_cars").show()
And on another I tried to access it, without success (table or view not found).
#second Spark Shell
spark = SparkSession.builder.appName('spark_sql').getOrCreate()
spark.sql("SELECT * FROM global_temp.my_cars").show()
That's the error I receive :
pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`my_cars`; line 1 pos 14;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`my_cars`\n"
I've read that each spark-shell has its own context, and that's why one spark-shell cannot see the other. So I don't understand, what's the usage of the GTV, where will it be useful ?
Thanks
in the spark documentation you can see:
If you want to have a temporary view that is shared among all sessions
and keep alive until the Spark application terminates, you can create
a global temporary view.
The global table remains accessible as long as the application is alive.
Opening a new shell and giving it the same application will just create a new application.
you can try and test it within the same shell:
spark.newSession.sql("SELECT * FROM global_temp.my_cars").show()
please see my answer on a similar question for a more detailed example as well as a short definition of a Spark Application and Spark Session
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it,
df.createGlobalTempView("people")

Cassandra Hector: How to verify the success/failure of a row update (error handling)

I'm using Hector to interact with a cassandra database from a java application. (Hector 1.0-1)
In this example, it shows how to insert (or update) a field.
mutator.addInsertion("650222", "Npanxx", HFactory.createStringColumn("state", "CA"));
MutationResult mr = mutator.execute();
However, there is not much information on the outcome of the operation. How can we verify if the operation was successful or not? The return value is a ResultStatus implementation and the 3 methods that can be called are:
mr.getHostUsed()
mr.getExecutionTimeNano()
mr.getExecutionTimeMicro()
Can I assume that if there were no exceptions calling the execute() method, that the operation succeeded?
It looks like the execute method doesn't declare any exceptions thrown because it will throw instances of HectorException which is a RuntimeException.
So yes, if no exceptions are thrown, the insert succeeded. Otherwise you will get an instance of HectorException thrown (likely HTimedOutException/HUnavailableException for problems on the Cassandra side and something else for something on the Hector side).

Resources