spark GroupBy throws StateSchemaNotCompatible exception with different "Existing key schema" - apache-spark

I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error.
can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.

Related

AWS Glue reading data from Sybase table

While loading data from Sybase DB in AWS Glue I encounter an error:
Py4JJavaError: An error occurred while calling o261.load.
: java.sql.SQLException: The identifier that starts with '__SPARK_GEN_JDBC_SUBQUERY_NAME' is too long. Maximum length is 30.
The code I use is:
spark.read.format("jdbc").
option("driver", "net.sourceforge.jtds.jdbc.Driver").
option("url", jdbc_url).
option("query", query).
option("user", db_username).
option("password", db_password).
load()
Is there any way to set this identifier as a custom one in order to have it shorter? What's interesting I am able to load all the data from a particular table by replacing query option with option("dbtable", table) but invoking a custom query is impossible.
Best Regards

Getting SyntaxException programmatically creating a table with the Cassandra Python driver

Error:
cassandra.protocol.SyntaxException: \
<Error from server: code=2000 [Syntax error in CQL query] \
message="line 1:36 no viable alternative at input '(' \
(CREATE TABLE master_table(dict_keys[(]...)">
Code:
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session=cluster.connect('firstkey')
ColName={"qty_dot_url": "int",
"qty_hyphen_url": "int",
"qty_underline_url": "int",
"qty_slash_url": "int"}
columns = ColName.keys()
values = ColName.values()
session.execute('CREATE TABLE master_table({ColName} {dataType}),PRIMARY KEY(qty_dot_url)'.format(ColName=columns, dataType=values))
How to resolve above mentioned error?
So I replaced the session.execute with a print, and it produced this:
CREATE TABLE master_table(dict_keys(['qty_dot_url', 'qty_hyphen_url', 'qty_underline_url', 'qty_slash_url']) dict_values(['int', 'int', 'int', 'int'])),PRIMARY KEY(qty_dot_url)
That is not valid CQL. It needs to look like this:
CREATE TABLE master_table(qty_dot_url int, qty_hyphen_url int,
qty_underline_url int, qty_slash_url int, PRIMARY KEY(qty_dot_url))
I was able to create that by making these adjustments to your code:
createTableCQL = "CREATE TABLE master_table("
for key, value in ColName.items():
createTableCQL += key + " " + value + ", "
createTableCQL += "PRIMARY KEY(qty_dot_url))"
You could then follow that with a session.execute(createTableCQL).
Notes:
The PRIMARY KEY definition must be inside the paren list.
Creating schema from inside application code is often problematic, and can create a schema disagreement in the cluster. It's almost always better to create tables outside of code.
The syntax exception is a result of your Python code generating an invalid CQL which Aaron pointed out in his response.
To add to his answer, you need to add additional steps whenever you are programatically making schema changes. In particular, you need to make sure that you check for schema agreement (i.e. the schema change has been propagated to all nodes) before moving on to the next bit in your code.
You will need to modify your code to save the result from the schema change, for example:
resultset = session.execute(SimpleStatement("CREATE TABLE ..."))
then call this in your code:
resultset.response_future.is_schema_agreed
You'll need to loop through this check until True is returned. Depending on how long you want to wait (default max_schema_agreement_wait is 10 seconds), you'll need to implement some logic to do [something] when schema agreement is not achieved (because a node is down for example) -- this requires manual intervention from an operator to investigate the cluster.
As Aaron already said, performing schema changes programatically is very problematic and we discourage doing this unless you fully understand the pitfalls and know how to handle failures. Cheers!

Databricks Autoloader throws IllegalArgumentException

I'm trying the simplest auto loader example included in the databricks website
https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(input_data_path))
(df.writeStream.format("delta")
.option("checkpointLocation", chkpt_path)
.table("iot_stream"))
I keep getting this message:
IllegalArgumentException: cloudFiles.schemaLocation Could not find required option: schemaLocation. Please provide a schema location using cloudFiles.schemaLocation for storing inferred schema and supporting schema evolution.
If providing cloudFiles.schemaLocation is required, why do the examples everywhere are missing it? what's the underlying issue here?
I suspect what is going on is that you are not explicitly setting the .option("cloudFiles.schemaEvolutionMode")
Which means it is being set to the default which is "addNewColumns" as per https://docs.databricks.com/ingestion/auto-loader/options.html
And that requires you set the .option("cloudFiles.schemaLocation", path) in the reader.
Thus you are inadvertently requiring it and not setting it.

org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'

spark (2.4.5) is throwing the following error when trying to execute a select query similar to one shown below.
org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 12
SELECT cast(coalesce(column1,'') as string) as id,cast(coalesce(column2,'2020-01-01') as date) as date1
from 4dea68ed921940e58f027e7146d495a4
Table 4dea68ed921940e58f027e7146d495a4 is a temp view created in spark from dataframe.
This error is happening intermittently only after certain processes. Any help would be much appreciated.
The spark job is submitted through livy. Job contains two optional parameters and only one was provided. Providing all the parameters resolved the issue. Don't know why not providing an optional parameter caused this weird behavior but resolved the issue

Spark DataFrame Filter using Binary (Array[Bytes]) data

I have a DataFrame from a JDBC table hitting MySql and I need to filter it using a UUID. The data is stored in MySql using binary(16) and when querying out in spark is converted to Array[Byte] as expected.
I'm new to spark and have been trying various ways to pass a variable of type UUID into the DataFrame's filter method.
Ive tried statements like
val id: UUID = // other logic that looks this up
df.filter(s"id = $id")
df.filter("id = " convertToByteArray(id))
df.filter("id = " convertToHexString(id))
All of these error with different messages.
I just need to somehow pass in Binary types but can't seem to put my finger on how to do so properly.
Any help is greatly appreciated.
After reviewing even more sources online, I found a way to accomplish this without using the filter method.
When I'm reading from my sparkSession, I just use an adhoc table instead of table name, as follows:
sparkSession.read.jdbc(connectionString, s"(SELECT id, {other col omitted) FROM MyTable WHERE id = 0x$id) AS MyTable", props)
This pre-filters the results for me and then I just work with the data frame as I need.
If anyone knows of a solution using filter, I'd still love to know it as that would be useful in some cases.

Resources