Chaining Delta Streams programmatically raising AnalysisException - apache-spark

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here
DF_OUT.writeStream.format("delta").(...).start("path")
(...)
DF_IN = spark.readStream.format("delta").load("path)
1 - When I try to read it this wayin a subsequent readStream (chaining queries for an ETL Pipeline) from the same program I end up having the Exception below.
2 - When I run it in the scala REPL however, it runs smoothly.
Not sure What is happening there but it sure is puzzling.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
at org.apache.spark.sql.delta.DeltaErrors$.schemaNotSetException(DeltaErrors.scala:365)
at org.apache.spark.sql.delta.sources.DeltaDataSource.sourceSchema(DeltaDataSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:225)
at org.apache.spark.ui.DeltaPipeline$.main(DeltaPipeline.scala:114)

From the Delta Lake Quick Guide - Troubleshooting:
Table schema is not set error
Problem:
When the path of Delta table is not existing, and try to stream data from it, you will get the following error.
org.apache.spark.sql.AnalysisException: Table schema is not set. Write data into it or use CREATE TABLE to set the schema.;
Solution:
Make sure the path of a Delta table is created.

After reading the error message, I did try to be a good boy and follow the advice, so I tried to make sure there actually IS valid data in the delta folder I am trying to read from BEFORE calling the readStream, and voila !
def hasFiles(dir: String):Boolean = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).size > 0
} else false
}
DF_OUT.writeStream.format("delta").(...).start(DELTA_DIR)
while(!hasFiles(DELTA_DIR)){
print("DELTA FOLDER STILL EMPTY")
Thread.sleep(10000)
}
print("FOUND DATA ON DELTA A - WAITING 30 SEC")
Thread.sleep(30000)
DF_IN = spark.readStream.format("delta").load(DELTA_DIR)
It ended up working but I had to make sure to wait enough time for "something to happen" (don't know what exactly TBH, but it seems that reading from delta needs some writes to be complete - maybe metadata ? -
However, this still is a hack. I hope it was possible to start reading from an empty delta folder and wait for content to start pouring in it.

For me I couldnt find the absolute path a simple solution was using this alternative:
spark.readStream.format("delta").table("tableName")

Related

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, tablename):
table = str(schema) + str(".") + str(tablename)
(url, Properties, host, port, db) = con.getConnection("REDSHIFT")
df = spark.read.jdbc(url=url, table=table, properties=Properties)
return df
Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.
Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?
You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky.
Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

How to handle exceptions in azure databricks notebooks?

I am new to Azure and Spark and request your help on writing the exception handling code for the below scenario.
I have written HQL scripts (say hql1, hql2, hql3) in 3 different notebooks and calling them all on one master notebook (hql-master) as,
val df_tab1 = runQueryForTable("hql1", spark)
val df_tab2 = runQueryForTable("hql2", spark)
Now I have the output of HQL scripts stored as dataframe and I have to write exception handling on master notebook where if the master notebook has successfully executed all the dataframes (df1_tab, df2_tab), a success status should get inserted into the synapse table job_status.
Else if there was any error/exception during the execution of master notebook/dataframe, then that error message should be captured and a failure status should get inserted into the synapse table.
I already have the INSERT scripts for success/failure message insert. It will be really helpful if you please provide a sample code snippet through which the exception handling part can be achieved. Thank you!!
basically, it's just a simple try/except code, something like this:
results = {}
were_errors = False
for script_name in ['script1', 'script2', 'script3']:
try:
retValue = dbutils.notebook.run(script_name)
results[script_name] = retValue
except Exception as e:
results[script_name] = "Error: {e}"
were_errors = True
if were_errors:
log failure # you can use data from results variable
else:
log success

Google Cloud Bigquery Library Error

I am receiving this error
Cannot set destination table in jobs with DDL statements
When I try to resubmit a job from the job.build_resource() function in the google.cloud.bigquery library.
It seems that the destination table is set to something like this after that function call.
'destinationTable': {'projectId': 'xxx', 'datasetId': 'xxx', 'tableId': 'xxx'},
Am I doing something wrong here? Thanks to anyone that can give me any guidance here.
EDIT:
The job is initially being triggered by this
query = bq.query(sql_rendered)
We store the job id and use it later to check the status.
We get the job like this
job = bq.get_job(job_id=job_id)
If it meets a condition, in this case it failed due to rate limiting. We retry the job.
We retry the job like this
di = job._build_resource()
jo = bigquery.Client(project=self.project_client).job_from_resource(di)
jo._begin()
I think that's pretty much all of the code you need, but happy to provide more if needed.
You are seeing this error because you have a DDL statement in your query. What is happening is that the job_config is changing some values after the execution of the first query, particularly the job_config.destination . In order to try to overcome this issue, you could try to reset the value of job_config.destination to None after each job submission or use a different job_config for every query.

In Spark Streaming how to process old data and delete processed Data

We are running a Spark streaming job that retrieves files from a directory (using textFileStream).
One concern we are having is the case where the job is down but files are still being added to the directory.
Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
2) Is there a way to delete the processed files?
The article below pretty much covers all your questions.
https://blog.yanchen.ca/2016/06/28/fileinputdstream-in-spark-streaming/
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
Stream reader initiates batch window using the system clock when a job/application is launched. Apparently all the files created before would be ignored. Try enabling checkpointing.
2) Is there a way to delete the processed files?
Deleting files might be unnecessary. If checkpointing works, the files not being processed are identified by Spark. If for some reason the files are to be deleted, implement a custom input format and reader (please refer article) to capture the file name and use this information as appropriate. But I wouldn't recommend this approach.
Is there a way to delete the processed files?
In my experience, I canĀ“t get to work the checkpointing feature so I had to delete/move the processed files that have entered each batch.
The way for getting those files is a bit tricky, but basically we can say that they are ancestors (dependencies) of the current RDD. What I use then, is a recursive method that crawls that structure and recovers the names of the RDDs that begin with hdfs.
/**
* Recursive method to extract original metadata files involved in this batch.
* #param rdd Each RDD created for each batch.
* #return All HDFS files originally read.
*/
def extractSourceHDFSFiles(rdd: RDD[_]): Set[String] = {
def extractSourceHDFSFilesWithAcc(rdd: List[RDD[_]]) : Set[String] = {
rdd match {
case Nil => Set()
case head :: tail => {
val name = head.toString()
if (name.startsWith("hdfs")){
Set(name.split(" ")(0)) ++ extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
else {
extractSourceHDFSFilesWithAcc(head.dependencies.map(_.rdd).toList) ++ extractSourceHDFSFilesWithAcc(tail)
}
}
}
}
extractSourceHDFSFilesWithAcc(rdd.dependencies.map(_.rdd).toList)
}
So, in the forEachRDD method you can easily invoke it:
stream.forEachRDD(rdd -> {
val filesInBatch = extractSourceHDFSFiles(rdd)
logger.info("Files to be processed:")
// Process them
// Delete them when you are done
})
The answer to your second question,
It is now possible in Spark 3. You can use "cleanSource" option for readStream.
Thanks to documentation https://spark.apache.org/docs/latest/structuread-streaming-programming-guide.html and this video https://www.youtube.com/watch?v=EM7T34Uu2Gg.
After searching for many hours, finally got the solution

Spark DataFrame Filter using Binary (Array[Bytes]) data

I have a DataFrame from a JDBC table hitting MySql and I need to filter it using a UUID. The data is stored in MySql using binary(16) and when querying out in spark is converted to Array[Byte] as expected.
I'm new to spark and have been trying various ways to pass a variable of type UUID into the DataFrame's filter method.
Ive tried statements like
val id: UUID = // other logic that looks this up
df.filter(s"id = $id")
df.filter("id = " convertToByteArray(id))
df.filter("id = " convertToHexString(id))
All of these error with different messages.
I just need to somehow pass in Binary types but can't seem to put my finger on how to do so properly.
Any help is greatly appreciated.
After reviewing even more sources online, I found a way to accomplish this without using the filter method.
When I'm reading from my sparkSession, I just use an adhoc table instead of table name, as follows:
sparkSession.read.jdbc(connectionString, s"(SELECT id, {other col omitted) FROM MyTable WHERE id = 0x$id) AS MyTable", props)
This pre-filters the results for me and then I just work with the data frame as I need.
If anyone knows of a solution using filter, I'd still love to know it as that would be useful in some cases.

Resources