spark (2.4.5) is throwing the following error when trying to execute a select query similar to one shown below.
org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 12
SELECT cast(coalesce(column1,'') as string) as id,cast(coalesce(column2,'2020-01-01') as date) as date1
from 4dea68ed921940e58f027e7146d495a4
Table 4dea68ed921940e58f027e7146d495a4 is a temp view created in spark from dataframe.
This error is happening intermittently only after certain processes. Any help would be much appreciated.
The spark job is submitted through livy. Job contains two optional parameters and only one was provided. Providing all the parameters resolved the issue. Don't know why not providing an optional parameter caused this weird behavior but resolved the issue
Related
I a trying to use ARRAY_AGG function in Spark SQL. When I use it, it throws error
<<Undefined function: 'array_agg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default>>
Dataset<Row> finalDS1 = sparkSession.sql("select array_agg(company_private_id) from TEMP_COMPANY_PRIVATE_VIEW");
Anyone know how to solve it? I am trying to compare one array with another column. For that I am using ARRAY_AGG.
"select cp.array_column & (select array_agg(int_column) from getCompanyPrivateDS ds1) as filtered_data from getCompanyPrivateDS cp"
I think this is a documentation error by Spark. They clearly show array_agg() in their function list: https://spark.apache.org/docs/latest/api/sql/index.html#array_agg
but I have also experienced that this function doesn't work on Spark 3.1.2
Collect_set() and collect_list() should work for your purposes: the former dedupes results, while the latter doesn't.
While loading data from Sybase DB in AWS Glue I encounter an error:
Py4JJavaError: An error occurred while calling o261.load.
: java.sql.SQLException: The identifier that starts with '__SPARK_GEN_JDBC_SUBQUERY_NAME' is too long. Maximum length is 30.
The code I use is:
spark.read.format("jdbc").
option("driver", "net.sourceforge.jtds.jdbc.Driver").
option("url", jdbc_url).
option("query", query).
option("user", db_username).
option("password", db_password).
load()
Is there any way to set this identifier as a custom one in order to have it shorter? What's interesting I am able to load all the data from a particular table by replacing query option with option("dbtable", table) but invoking a custom query is impossible.
Best Regards
I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error.
can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.
I am making JDBC connection to Denodo database using pyspark. The table that i am connecting to contains "TIMESTAMP_WITH_TIMEZONE" datatype for 2 columns. Since spark provides builtin jdbc connection to a handful of dbs only of which denodo is not a part, it is not able to recognize "TIMESTAMP_WITH_TIMEZONE" datatype and hence not able to map to any of its spark sql dataype.
To overcome this i am providing my custom schema(c_schema here) but this is not working as well and i am getting the same error. Below is the code snippet.
c_schema="game start date TIMESTAMP,game end date TIMESTAMP"
df = spark.read.jdbc("jdbc_url", "schema.table_name",properties={"user": "user_name", "password": "password","customSchema":c_schema,"driver": "com.denodo.vdp.jdbc.Driver"})
Please let me know how shall i fix this.
For anyone else facing this issue while connecting to denodo using spark,use CAST function to convert the datatype "TIMESTAMP_WITH_TIMEZONE" into any other datatype like String,Date or Timestamp etc. I had posted this question on denodo community page too and i have attached its official response.
CAST("PLANNED START DATE" as DATE) as "PLANNED_START_DATE"
If i run the following code in spark ( 2.3.2.0-mapr-1901) , it runs fine on the first run.
SELECT count( `cpu-usage` ) as `cpu-usage-count` , sum( `cpu-usage` ) as `cpu-usage-sum` , percentile_approx( `cpu-usage`, 0.95 ) as `cpu-usage-approxPercentile`
FROM filtered_set
Where filtered_set is a DataFrame that has been registered as a temp view using createOrReplaceTempView.
I get a result and all is good on the first call. But...
If i then run this job again, ( note that this is a shared spark context, managed via apache livy), Spark throws:
Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: Undefined function: 'count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 10
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1216)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1216)
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1215)
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1213)
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
...
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
This problem occurs on the second run of the Livy job ( which is using the previous Spark session). It is not isolated just to the count function, (etc also happens with sum, etc) and any function appears to fail on the second run, regardless of what was called in the first run.
It seems like Spark's function registry is being cleared out ( including the default built in functions). We're not doing anything with the spark context.
Questions:
- Is this expected or normal behaviour with spark?
- How would I re-set or initialise the spark session so it doesn't lose all these functions?
I have seen Undefined function errors described elsewhere in terms of user defined functions but never the built ins.