Problem on saving Spark timestamp into Azure Synapse datetime2(7) - azure

I have a database in Azure synapse with only one column with datatype datetime2(7).
In Azure Databricks I have a table with the following schema.
df.schema
StructType(List(StructField(dates_tst,TimestampType,true)))
When I try to save on Synapse, I get an error message
Py4JJavaError: An error occurred while calling o535.save.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 15.0 failed 4 times, most recent failure: Lost task 3.3 in stage 15.0 (TID 46) (10.139.64.5 executor 0): com.microsoft.sqlserver.jdbc.SQLServerException: 110802;An internal DMS error occurred that caused this operation to fail
SqlNativeBufferBufferBulkCopy.WriteTdsDataToServer, error in OdbcDone: SqlState: 42000, NativeError: 4816, 'Error calling: bcp_done(this->GetHdbc()) | SQL Error Info: SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column type from bcp client for colid 1. | Error calling: pConn->Done() | state: FFFF, number: 75205, active connections: 35', Connection String: Driver={pdwodbc17e};app=TypeD00-DmsNativeWriter:DB2\mpdwsvc (56768)-ODBC;autotranslate=no;trusted_connection=yes;server=\\.\pipe\DB.2-e2f5d1c1f0ba-0\sql\query;database=Distribution_24
EDIT: Runtime version 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
EDIT 2:
It could be solved, the errors were:
use incorrect format in write options, I was using "com.microsoft.sqlserver.jdbc.spark" and changed it to "com.databricks.spark.sqldw".
There were also errors in the scope credentials

org.apache.spark.SparkException: Job aborted due to stage failure
Generally above error occurs when you perform operations on column where there is null value present.
Replace null values with valid datetime values.
Also check spark version.
Refer this SO Answer by Lyuben Todorov

Related

Hoodie (Hudi) precombine field failing on NULL

My AWS Glue job for Hudi CDC is failing on a column that is a precombine field (see error message below). I have validated that there are no NULL values on this column (it has an AFTER UPDATE Trigger and a default of NOW() set). When I query the parquet files using spark, the only records that show NULL are records that are marked with an operation ('op') of DELETE. From my understanding, Hudi only transmits the PRIMARY KEY on a DELETE operation and nothing else.
Why is Hudi failing on a precombine with a NULL value in the DELETE operation? How can I fix this? Am I missing an option or something? Any help is greatly appreciated.
Error message:
2022-06-06 19:05:13,633 ERROR [Executor task launch worker for task
2.0 in stage 46.0 (TID 264)] executor.Executor (Logging.scala:logError(94)): Exception in task 2.0 in stage 46.0 (TID
264) org.apache.hudi.exception.HoodieException: The value of
last_modified_date can not be null
Hudi options:
options = {
"hoodie_overrides": {
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "last_modified_date",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.NonPartitionedExtractor",
"hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
"hoodie.datasource.hive_sync.support_timestamp": "true",
}
}
Spark query of parquet files:
You can try setting this configuration to "false":
hoodie.combine.before.delete
Source: hoodie.combine.before.delete

Issue with Spark SQL to Pandas

I'm getting error in the following code:
txn = spark.sql('select * from temporary view').toPandas()
I got Py4JavaError: An error occurred while calling o420.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure.
Please help!

Getting an error while writing to Elastic search from spark with custom mapping id

I'm trying to write a dataframe from spark to Elastic with a custom mapping id. and when I do that I'm getting the below error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 16 times, most recent failure: Lost task 0.15 in stage 14.0 (TID 860, ip-10-122-28-111.ec2.internal, executor 1): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: [DataFrameFieldExtractor for field [[paraId]]] cannot extract value from entity [class java.lang.String] | instance
and below is the configuration used writing to ES.
var config= Map("es.nodes"->node,
"es.port"->port,
"es.clustername"->clustername,
"es.net.http.auth.user" -> login,
"es.net.http.auth.pass" -> password,
"es.write.operation" -> "upsert",
"es.mapping.id" -> "paraId",
"es.resource" -> "test/type")
df.saveToEs(config)
I'm using the 5.6 version of ES and 2.2.0 of Spark. Let me know if you guys have any insight on this.
Thanks.!

count throws java.lang.NumberFormatException: null on the file loaded from object store with inferSchema enabled

The count() on a dataframe loaded from IBM Blue mix object storage throws the following exception when inferSchema is enabled:
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 43.0 failed 10 times, most recent failure: Lost task 3.9 in stage 43.0 (TID 166, yp-spark-dal09-env5-0034): java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:554)
at java.lang.Integer.parseInt(Integer.java:627)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
I don't get the above exception if I disable the inferSchema.
Why am I getting this exception? by default, how many rows are read by databricks if inferSchema is enabled?
This was actually an issue with the spark-csv package (null value still not correctly parsed #192) that was dragged into spark 2.0. It has been corrected and pushed in spark 2.1.
Here is the associated PR : [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens.
Since you are already using spark 2.0 you can easily upgrade to 2.1 and drop that spark-csv package. It's not needed anyway.

"Missing EOF" Error message when querying using the spark-cassandra-connector

I want to query a cassandra table using the spark-cassandra-connector using the following statements:
sc.cassandraTable("citizens","records")
.select("identifier","name")
.where( "name='Alice' or name='Bob' ")
And I get this error message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 81.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 81.0 (TID 9199, mydomain):
java.io.IOException: Exception during preparation of
SELECT "identifier", "name" FROM "citizens"."records" WHERE token("id") > ? AND token("id") <= ? AND name='Alice' or name='Bob' LIMIT 10 ALLOW FILTERING:
line 1:127 missing EOF at 'or' (...<= ? AND name='Alice' [or] name...)
What am I doing wrong here and how can I make an or query using the where clause of the connector?
Your OR clause is not valid CQL. For this few key values (I'm assuming name is a key) you can use an IN clause.
.where( "name in ('Alice', 'Bob') ")
The where clause is used for pushing down CQL to Cassandra so only valid CQL can go inside it. If you are looking to do a Spark Side Sql-Like syntax check out SparkSql and Datasets.

Resources