Spark submit with oozie - apache-spark

I have written a spark code -
customDF.registerTempTable("customTable")
var query = "select date, id ts,errorode from customTable"
val finalCustomDF = hiveContext.sql(query)
finalCustomDF.write.format("com.databricks.spark.csv").save("/user/oozie/data")
When i run this code using spark submit, it runs fine but when i run it using oozie coordinator. I get following exception.
User class threw exception: org.apache.spark.sql.AnalysisException: character '<EOF>' not supported here; line 1 pos 111
I have tried reading data from existing hive table it works but issue is with customTable.

Related

Error writing data to Bigquery using Databricks Pyspark

I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)
You need to install the GCS connector on your cluster first

How to handle exceptions in azure databricks notebooks?

I am new to Azure and Spark and request your help on writing the exception handling code for the below scenario.
I have written HQL scripts (say hql1, hql2, hql3) in 3 different notebooks and calling them all on one master notebook (hql-master) as,
val df_tab1 = runQueryForTable("hql1", spark)
val df_tab2 = runQueryForTable("hql2", spark)
Now I have the output of HQL scripts stored as dataframe and I have to write exception handling on master notebook where if the master notebook has successfully executed all the dataframes (df1_tab, df2_tab), a success status should get inserted into the synapse table job_status.
Else if there was any error/exception during the execution of master notebook/dataframe, then that error message should be captured and a failure status should get inserted into the synapse table.
I already have the INSERT scripts for success/failure message insert. It will be really helpful if you please provide a sample code snippet through which the exception handling part can be achieved. Thank you!!
basically, it's just a simple try/except code, something like this:
results = {}
were_errors = False
for script_name in ['script1', 'script2', 'script3']:
try:
retValue = dbutils.notebook.run(script_name)
results[script_name] = retValue
except Exception as e:
results[script_name] = "Error: {e}"
were_errors = True
if were_errors:
log failure # you can use data from results variable
else:
log success

How to work with temporary tables in foreachBatch?

We are building a streaming platform where it is essential to work with SQL's in batches.
val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {
df.createOrReplaceTempView("events")
val df1 = ExecutionContext.getSparkSession.sql("select * from events")
df1.limit(5).show()
// More complex processing on dataframes
}}.trigger(trigger).outputMode(outputMode).start()
query.awaitTermination()
Error thrown is :
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';
Streaming source is Kafka with watermarking and without using Spark-SQL we are able to execute dataframe transformations. Spark version is 2.4.0 and Scala is 2.11.7. Trigger is ProcessingTime every 1 minute and OutputMode is Append.
Is there any other approach to facilitate use of spark-sql within foreachBatch ? Would it work with upgraded version of Spark - in which case to version do we upgrade ?
Kindly help. Thank you.
tl;dr Replace ExecutionContext.getSparkSession with df.sparkSession.
The reason of the StreamingQueryException is that the streaming query tries to access the events temporary table in a SparkSession that knows nothing about it, i.e. ExecutionContext.getSparkSession.
The only SparkSession that has this events temporary table registered is exactly the SparkSession the df dataframe is created within, i.e. df.sparkSession.
Please check the code snippet below. Here, I have created two separate DataFrames, responseDF1 and responseDF2 from resultDF and shown the output in the console. responseDF2 is created using a temporary table. You can try the same.
resultDF.writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
val responseDF1 = batchDF.selectExpr("ResponseObj.type","ResponseObj.key", "ResponseObj.activity", "ResponseObj.price")
responseDF1.show()
responseDF1.createTempView("responseTbl1")
val responseDF2 = batchDF.sparkSession.sql("select activity, key from responseTbl1")
responseDF2.show()
batchDF.sparkSession.catalog.dropTempView("responseTbl1")
batchDF.unpersist()
()}.start().awaitTermination()
Code Snippet

spark-sql Table or view not found error

I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session

Connecting to Hive using Spark-SQL

I am running hive queries using Spark-SQL.
I made a hive context object
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
Then when I am trying to run the command:
hiveContext.sql("use db_name");
OR
hiveContext.hiveql("use db_name");
It doesnt work. It says database not found.
When I try to run
val db = hiveContext.hiveql("show databases");
db.collect.foreach(println);
It prints nothing. Just prints [default].
Any help would be appreciated.
hiveContext.sql("SELECT * FROM database.table")

Resources