Error when write spark dataframe from Databricks into Azure Synapse - apache-spark

I am trying to write the spark dataframe into Azure Syanpse database.
My code:
try:
re_spdf.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("encrypt", 'True') \
.option("trustServerCertificate", 'false') \
.option("hostNameInCertificate", '*.database.windows.net') \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
.save()
except ValueError as error :
print("Connector write failed", error)
Error message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 29.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 29.0 (TID 885, 10.139.64.8, executor 0):
com.microsoft.sqlserver.jdbc.SQLServerException:
PdwManagedToNativeInteropException ErrorNumber: 46724, MajorCode: 467,
MinorCode: 24, Severity: 20, State: 2, Exception of type
'Microsoft.SqlServer.DataWarehouse.Tds.PdwManagedToNativeInteropException' was thrown.
Even I googled this error message. I didnt get any useful solution.
Update: My working environment is Databricks pyspark notebook.
Any suggestions would be appreciated.

There is some column length limitation in the synapse DB table. It will allow only 4000 characters.
so When I use the com.databricks.spark.sqldw since it uses Polybase as the connector, I need to change the length of the column in DB table as well.
reference:https://forums.databricks.com/questions/21032/databricks-throwing-error-sql-dw-failed-to-execute.html
code:
df.write \
.format("com.databricks.spark.sqldw") \
.mode("append") \
.option("url", url) \
.option("user", username) \
.option("password", password) \
.option("maxStrLength", "4000" ) \
.option("tempDir", "tempdirdetails") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
.option("dbTable", table_name) \
.save()

Azure databricks documentation says format com.databricks.spark.sqldw to read/write data from/to data from an Azure Synapse table.

If you are using Synapse, why not Synapse notebooks and then writing the dataframe is as easy as calling synapsesql, eg
%%spark
df.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
You would save yourself some trouble and performance should be good as it's parallelised. This is the main article:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export

Related

Spark Action after dataframe load from jdbc driver taking too long

I am loading 10Gb or more data from two databases namely, redshift and snowflake into the dataframe. The data is loaded for comparison between the datasets. The code for loading data is same except for the jars used with no partition or with partition as well
This is the one below with one partition
df1 = sparkcontext.read \
.format("jdbc") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("url", jdbcurl) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df1.count()
df2 = sparkcontext.read \
.format("jdbc") \
.option("url", jdbcurl) \
.option("driver", "net.snowflake.client.jdbc.SnowflakeDriver") \
.option("schema", SchemaName) \
.option("query", query) \
.option("user", user) \
.option("password", password).load()
df2.count()
the time taken by second one is 5 secs and while the first one count takes 60 secs.
Since, at this moment we are not performing any calculation except counting number of rows in each dataframe, why is there a difference of ~50-55 secs between the two?
df1.join(df2, on=<clause>,how=outer).where(<clause>)
If I bypass this operation, the df1 and df2 are then joined with partitioning being done at the time of loading by passing , the snowflake stage completes in 4 minutes, but the redshift one takes 10 minutes, why is there a difference of 6 minutes between the two stages. The partition at the time of loading is done like this-
.option("partitionColumn", partitioncolumn) \
.option("lowerBound", lowerbound) \
.option("upperBound", upperbound) \
.option("numPartitions", numpartitions) \
Just wanted to understand why is there a difference between two datasets when the data is same and code is also same?

Getting error: Py4JJavaError while writing data from Databricks to sql data-warehouse

I followed this link writing data from Databricks to sql data-warehouse .
datafram.write
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver.......)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "table")
.option("tempDir", "Blob_url")
.save()
but still I am getting this error:
Py4JJavaError: An error occurred while calling 0174.save.
: java. lang. ClassNotFoundException
Please follow below steps:
Configure Azure Storage account access key with Databricks
spark.conf.set(
"fs.azure.account.key.<storage_account>.blob.core.windows.net","Azure_access_key")
Syntax of JDBC URL
jdbc_url = "jdbc:sqlserver://<Server_name>.sql.azuresynapse.net:1433;database<Database_name>;user=<user_name>;password=Vam#9182;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
Sample Data frame:
Writing data from Azure Databricks to Synapse:
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", jdbc_url) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<table_name>") \
.option("tempDir", "wasbs://sdd#vambl.blob.core.windows.net/") \
.mode("overwrite") \
.save()
Output:

How to exceed limit 256 length on Databricks

I am working with databricks. I created a function where I use a try and catch to catch any error messages. Unfortunately with errors with a length larger than 256 characters I cannot write to my target table.
def writeIntoSynapseLogTable(df,mode):
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", jdbc_string) \
.option("tempDir", f"""mytempDir """) \
.option("useAzureMSI","true") \
.option("dbTable", f"""mytable """) \
.options(nullValue="null") \
.mode(mode).save()
I assume that there is a limit imposed by polybase but I would like to understand if there is an .option() to solve this writing problem.
An example of my error:
An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.cp. : java.io.FileNotFoundException: Operation failed: "The specified filesystem does not exist.", 404, HEAD, https://xxxxxx.xxx.core.windows.net/xxxxxxxxxxxxx/myfile.csv?upn=false&action=getStatus&timeout=90 at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1179) at shaded.databricks.azurebfs.org.apach...
the solution is to add this option:
.option("maxStrLength", "4000" )
I hope I have been helpful for someone.

Spark write Dataframe to SQL Server Table with Insert Identity On

I have a Spark Dataframe that I want to push to an SQL table on a remote server. The table has an Id column that is set as an identity column. The Dataframe I want to push also has as Id column, and I want to use those Ids in the SQL table, without removing the identity option for the column.
I write the dataframe like this:
df.write.format("jdbc") \
.mode(mode) \
.option("url", jdbc_url) \
.option("dbtable", table_name) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
But I get the following response:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 41, 10.1.0.4, executor 0): java.sql.BatchUpdateException: Cannot insert explicit value for identity column in table &#39Table' when IDENTITY_INSERT is set to OFF.
I have tried to add a query to the writing like:
query = f"SET IDENTITY_INSERT Table ON;"
df.write.format("jdbc") \
.mode(mode) \
.option("url", jdbc_url) \
.option("query", query) \
.option("dbtable", table_name) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
But that just throws an SQL syntax error:
IllegalArgumentException: Both 'dbtable' and 'query' can not be specified at the same time.
Or if I try to run a read with the query first:
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'SET'.
This must be because it only supports SELECT statements.
Is it possible to do in Spark, or would I need to use a different connector and combine the setting of the insert identity on, together with regular insert into statements?
I would prefer a solution that allowed me to keep writing through the Spark context. But I am open to other solutions.
One way to work around this issue is the following:
Save your dataframe as a temporary table in your database.
Set identity insert to ON.
Insert into your real table the content of your temporary table.
Set identity insert to OFF.
Drop your temporary table.
Here's a pseudo code example:
tablename = "MyTable"
tmp_tablename = tablename+"tmp"
df.write.format("jdbc").options(..., dtable=tmp_tablename).save()
columns = ','.join(df.columns)
query = f"""
SET IDENTITY_INSERT {tablename} ON;
INSERT INTO {tablename} ({columns})
SELECT {columns} FROM {tmp_tablename};
SET IDENTITY_INSERT {tablename} OFF;
DROP TABLE {tmp_tablename};
"""
execute(query) # You can use Cursor from pyodbc for example to execute raw SQL queries

Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark. My data is read from a Kafka topic and then I am running windowed aggregation on event time.
# I have been able to create data frame pn_data_df after reading data from Kafka
Schema of pn_data_df
|
- id StringType
- source StringType
- source_id StringType
- delivered_time TimeStamp
windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
.withWatermark("delivered_time", "24 hours") \
.groupBy('source_id', window('delivered_time', '15 minute')) \
.count()
windowed_report_df = windowed_report_df \
.withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
.withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
.selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')
I am writing this windowed aggregation to my postgresql database which I have already created.
CREATE TABLE pn_delivery_report(
source_id bigint not null,
start_ts bigint not null,
end_ts bigint not null,
count integer not null,
unique(source_id, start_ts)
);
Writing to postgresql using spark jdbc allows me to either Append or Overwrite. Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.
def write_pn_report_to_postgres(df, epoch_id):
df.write \
.mode('append') \
.format('jdbc') \
.option("url", "jdbc:postgresql://db_endpoint/db") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "pn_delivery_report") \
.option("user", "postgres") \
.option("password", "PASSWORD") \
.save()
windowed_report_df.writeStream \
.foreachBatch(write_pn_report_to_postgres) \
.option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
.outputMode('update') \
.start()
How can I execute a query like
INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
(1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;
in foreachBatch.
Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now.
https://issues.apache.org/jira/browse/SPARK-19335
that's worked for me:
def _write_streaming(self,
df,
epoch_id
) -> None:
df.write \
.mode('append') \
.format("jdbc") \
.option("url", f"jdbc:postgresql://localhost:5432/postgres") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", 'table_test') \
.option("user", 'user') \
.option("password", 'password') \
.save()
df_stream.writeStream \
.foreachBatch(_write_streaming) \
.start() \
.awaitTermination()
You need to add ".awaitTermination()" at the end.

Resources