I am working with databricks. I created a function where I use a try and catch to catch any error messages. Unfortunately with errors with a length larger than 256 characters I cannot write to my target table.
def writeIntoSynapseLogTable(df,mode):
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", jdbc_string) \
.option("tempDir", f"""mytempDir """) \
.option("useAzureMSI","true") \
.option("dbTable", f"""mytable """) \
.options(nullValue="null") \
.mode(mode).save()
I assume that there is a limit imposed by polybase but I would like to understand if there is an .option() to solve this writing problem.
An example of my error:
An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.cp. : java.io.FileNotFoundException: Operation failed: "The specified filesystem does not exist.", 404, HEAD, https://xxxxxx.xxx.core.windows.net/xxxxxxxxxxxxx/myfile.csv?upn=false&action=getStatus&timeout=90 at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1179) at shaded.databricks.azurebfs.org.apach...
the solution is to add this option:
.option("maxStrLength", "4000" )
I hope I have been helpful for someone.
Related
I followed this link writing data from Databricks to sql data-warehouse .
datafram.write
.format("com.databricks.spark.sqldw")
.option("url", "jdbc:sqlserver.......)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "table")
.option("tempDir", "Blob_url")
.save()
but still I am getting this error:
Py4JJavaError: An error occurred while calling 0174.save.
: java. lang. ClassNotFoundException
Please follow below steps:
Configure Azure Storage account access key with Databricks
spark.conf.set(
"fs.azure.account.key.<storage_account>.blob.core.windows.net","Azure_access_key")
Syntax of JDBC URL
jdbc_url = "jdbc:sqlserver://<Server_name>.sql.azuresynapse.net:1433;database<Database_name>;user=<user_name>;password=Vam#9182;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
Sample Data frame:
Writing data from Azure Databricks to Synapse:
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", jdbc_url) \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<table_name>") \
.option("tempDir", "wasbs://sdd#vambl.blob.core.windows.net/") \
.mode("overwrite") \
.save()
Output:
I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.
I am trying to write the spark dataframe into Azure Syanpse database.
My code:
try:
re_spdf.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("encrypt", 'True') \
.option("trustServerCertificate", 'false') \
.option("hostNameInCertificate", '*.database.windows.net') \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
.save()
except ValueError as error :
print("Connector write failed", error)
Error message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 29.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 29.0 (TID 885, 10.139.64.8, executor 0):
com.microsoft.sqlserver.jdbc.SQLServerException:
PdwManagedToNativeInteropException ErrorNumber: 46724, MajorCode: 467,
MinorCode: 24, Severity: 20, State: 2, Exception of type
'Microsoft.SqlServer.DataWarehouse.Tds.PdwManagedToNativeInteropException' was thrown.
Even I googled this error message. I didnt get any useful solution.
Update: My working environment is Databricks pyspark notebook.
Any suggestions would be appreciated.
There is some column length limitation in the synapse DB table. It will allow only 4000 characters.
so When I use the com.databricks.spark.sqldw since it uses Polybase as the connector, I need to change the length of the column in DB table as well.
reference:https://forums.databricks.com/questions/21032/databricks-throwing-error-sql-dw-failed-to-execute.html
code:
df.write \
.format("com.databricks.spark.sqldw") \
.mode("append") \
.option("url", url) \
.option("user", username) \
.option("password", password) \
.option("maxStrLength", "4000" ) \
.option("tempDir", "tempdirdetails") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
.option("dbTable", table_name) \
.save()
Azure databricks documentation says format com.databricks.spark.sqldw to read/write data from/to data from an Azure Synapse table.
If you are using Synapse, why not Synapse notebooks and then writing the dataframe is as easy as calling synapsesql, eg
%%spark
df.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
You would save yourself some trouble and performance should be good as it's parallelised. This is the main article:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export
I'm performing an aggregation on a streaming dataframe and trying to write the result to an output directory. But I'm getting an exception saying
pyspark.sql.utils.AnalysisException: 'Data source json does not support Update output mode;
I'm getting similar error with "complete" output mode.
This is my code:
grouped_df = logs_df.groupBy('host', 'timestamp').agg(count('host').alias('total_count'))
result_host = grouped_df.filter(col('total_count') > threshold)
writer_query = result_host.writeStream \
.format("json") \
.queryName("JSON Writer") \
.outputMode("update") \
.option("path", "output") \
.option("checkpointLocation", "chk-point-dir") \
.trigger(processingTime="1 minute") \
.start()
writer_query.awaitTermination()
FileSinks only support "append" mode according to documentation on OutputSinks, see "supported output modes" in below table.
I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer function by adding an additional argument for table name, but I'm not sure how to pass the table name argument.
The example here is pretty helpful, but in the python example the table name is hardcoded, and it looks like in the scala example they're referencing a global variable(?) I would like to pass the name of the table into the function.
The function given in the python example at the link above is:
def writeToSQLWarehose(df, epochId):
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", "my_table_in_dw_copy") \
.option("tempdir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
I'd like to use something like this:
def writeToSQLWarehose(df, epochId, tableName):
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", tableName) \
.option("tempdir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.save()
But I'm not sure how to pass the additional argument through foreachBatch.
Something like this should work.
streamingDF.writeStream.foreachBatch(lambda df,epochId: writeToSQLWarehose(df, epochId,tableName )).start()
Samellas' solution does not work if you need to run multiple streams. The foreachBatch function gets serialised and sent to Spark worker. The parameter seems to be still a shared variable within the worker and may change during the execution.
My solution is to add parameter as a literate column in the batch dataframe (passing a silver data lake table path to the merge operation):
.withColumn("dl_tablePath", func.lit(silverPath))
.writeStream.format("delta")
.foreachBatch(insertIfNotExisting)
In the batch function insertIfNotExisting, I pick up the parameter and drop the parameter column:
def insertIfNotExisting(batchDf, batchId):
tablePath = batchDf.select("dl_tablePath").limit(1).collect()[0][0]
realDf = batchDf.drop("dl_tablePath")