How to handle exceptions in azure databricks notebooks? - apache-spark

I am new to Azure and Spark and request your help on writing the exception handling code for the below scenario.
I have written HQL scripts (say hql1, hql2, hql3) in 3 different notebooks and calling them all on one master notebook (hql-master) as,
val df_tab1 = runQueryForTable("hql1", spark)
val df_tab2 = runQueryForTable("hql2", spark)
Now I have the output of HQL scripts stored as dataframe and I have to write exception handling on master notebook where if the master notebook has successfully executed all the dataframes (df1_tab, df2_tab), a success status should get inserted into the synapse table job_status.
Else if there was any error/exception during the execution of master notebook/dataframe, then that error message should be captured and a failure status should get inserted into the synapse table.
I already have the INSERT scripts for success/failure message insert. It will be really helpful if you please provide a sample code snippet through which the exception handling part can be achieved. Thank you!!

basically, it's just a simple try/except code, something like this:
results = {}
were_errors = False
for script_name in ['script1', 'script2', 'script3']:
try:
retValue = dbutils.notebook.run(script_name)
results[script_name] = retValue
except Exception as e:
results[script_name] = "Error: {e}"
were_errors = True
if were_errors:
log failure # you can use data from results variable
else:
log success

Related

Error writing data to Bigquery using Databricks Pyspark

I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)
You need to install the GCS connector on your cluster first

Chaining Delta Streams programmatically raising AnalysisException

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here
DF_OUT.writeStream.format("delta").(...).start("path")
(...)
DF_IN = spark.readStream.format("delta").load("path)
1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.
2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)
From the Delta Lake Quick Guide - Troubleshooting:
Table schema is not set error
Problem:
When the path of Delta table is not existing, and try to stream data from it, you will get the following error.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
Solution:
Make sure the path of a Delta table is created.
After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !
def hasFiles(dir: String):Boolean = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).size > 0
} else false
}
DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)
while(!hasFiles(DELTA_DIR)){
print("DELTA FOLDER STILL EMPTY")
Thread.sleep(10000)
}
print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)
DF_IN = spark.readStream.format("delta").load(DELTA_DIR)
It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -
However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.
For me I couldnt find the absolute path a simple solution was using this alternative:
spark.readStream.format("delta").table("tableName")

PicklingError in Pyspark

I have written below function in pyspark to get deptid and return a dataframe which i want to use in spark sql .
def get_max_salary(deptid):
sql_salary="select max(salary) from empoyee where depid ={}"
df_salary = spark.sql(sql_salary.format(deptid))
return df_salary
spark.udf.register('get_max_salary',get_max_salary)
However i get below error message . I searched online but i couldnt find a proper solution anywhere . could someone please help me here
Error Message - PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Google Cloud Bigquery Library Error

I am receiving this error
Cannot set destination table in jobs with DDL statements
When I try to resubmit a job from the job.build_resource() function in the google.cloud.bigquery library.
It seems that the destination table is set to something like this after that function call.
'destinationTable': {'projectId': 'xxx', 'datasetId': 'xxx', 'tableId': 'xxx'},
Am I doing something wrong here? Thanks to anyone that can give me any guidance here.
EDIT:
The job is initially being triggered by this
query = bq.query(sql_rendered)
We store the job id and use it later to check the status.
We get the job like this
job = bq.get_job(job_id=job_id)
If it meets a condition, in this case it failed due to rate limiting. We retry the job.
We retry the job like this
di = job._build_resource()
jo = bigquery.Client(project=self.project_client).job_from_resource(di)
jo._begin()
I think that's pretty much all of the code you need, but happy to provide more if needed.
You are seeing this error because you have a DDL statement in your query. What is happening is that the job_config is changing some values after the execution of the first query, particularly the job_config.destination . In order to try to overcome this issue, you could try to reset the value of job_config.destination to None after each job submission or use a different job_config for every query.

Spark submit with oozie

I have written a spark code -
customDF.registerTempTable("customTable")
var query = "select date, id ts,errorode from customTable"
val finalCustomDF = hiveContext.sql(query)
finalCustomDF.write.format("com.databricks.spark.csv").save("/user/oozie/data")
When i run this code using spark submit, it runs fine but when i run it using oozie coordinator. I get following exception.
User class threw exception: org.apache.spark.sql.AnalysisException: character '<EOF>' not supported here; line 1 pos 111
I have tried reading data from existing hive table it works but issue is with customTable.

Resources