How to run long-running/eternal workflows in Azure Databricks - databricks

We want to listen to an Azure Event Hub and write the data to a Delta table in Azure Databricks. We have create
df = spark.readStream.format("eventhubs").options(**ehConf).load()
# Code omitted where message content is expanded into columns in the dataframe
df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
.toTable("mydb.mytable")
This code works perfectly and the notebook stays on the dr.writeStream row until the job is canceled.
How should I set this up so it will run "eternally" and maybe even restart if the code crashes? Should I run it as a normal workflow and e.g. set it to run every minute but set Max concurrent runs to 1?

You don't need to schedule the job every minute - you'll get a lot of error runs that will fill your UI. Instead, you configure your job with max concurrent runs set to 1, no schedule, and have Max Retries (and maybe configure retry timeout) set to some big number, so in the case when job fails, the workflows manager will restart it automatically.
P.S. Instead of using Spark connector for EventHubs, consider use of the built-in Kafka connector - it's more performant and stable. This answer is about DLT, but really it works for normal code as well.

Related

Spark Structured Stream Scalability and Duplicates Issue

I am using Spark Structured Streaming on Databricks Cluster to extract data from Azure Event Hub, process it, and write it to snowflake using ForEachBatch with Epoch_Id/ Batch_Id passed to the foreach batch function.
My code looks something like below:
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(EVENT_HUB_CONNECTION_STRING)
ehConf['eventhubs.consumerGroup'] = consumergroup
# Read stream data from event hub
spark_df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
Some transformations...
Write to Snowflake
def foreach_batch_function(df, epoch_id):
df.write\
.format(SNOWFLAKE_SOURCE_NAME)\
.options(**sfOptions)\
.option("dbtable", snowflake_table)\
.mode('append')\
.save()
processed_df.writeStream.outputMode('append').\
trigger(processingTime='10 seconds').\
option("checkpointLocation",f"checkpoint/P1").\
foreachBatch(foreach_batch_function).start()
Currently I am facing 2 issues:
When node failure occurs. Although on spark official web, it is mentioned that when one uses ForeachBatch along with epoch_id/batch_id during recovery form node failure there shouldn't be any duplicates, but I do find duplicates getting populated in my snowflake tables. Link for reference: [Spark Structured Streaming ForEachBatch With Epoch Id][1].
I am encountering errors a.)TransportClient: Failed to send RPC RPC 5782383376229127321 to /30.62.166.7:31116: java.nio.channels.ClosedChannelException and b.)TaskSchedulerImpl: Lost executor 1560 on 30.62.166.7: worker decommissioned: Worker Decommissioned very frequently on my databricks cluster. No matter how many executors I allocate or how much executors memory I increase, the clusters reaches to max worker limit and I receive one of the two error with duplicates being populated in my snowflake table after its recovery.
Any solution/ suggestion to any of the above points would be helpful.
Thanks in advance.
foreachBatch is by definition not idempotent because when currently executed batch fails, then it's retries, and partial results could be observed, and this is matching your observations. Idempotent writes in foreachBatch are applicable only for Delta Lake tables, not for all sink types (in some cases, like, Cassandra, it could work as well). I'm not so familiar with Snowflake, but maybe you can implement something similar to other database - write data into a temporary table (each batch will do an overwrite) and then merge from that temporary table into a target table.
Regarding 2nd issue - it looks like you're using autoscaling cluster - in this case, workers could be decommissioned because cluster managers detects that cluster isn't fully loaded. To avoid that you can disable autoscaling, and use fixed size cluster.

How to understand spark execution time without access to spark history UI?

I have written pyspark job , and my job is running longer . I want to analyze job execution and fix the code part that is causing slowness. Due to access issue over spark history ui I can not analyze job plan. Hence I have to do some tricks around the code and understand at what section spark is consuming more time.
I have tried to run count on data-frame but it seems this is not that much help to understand job slowness.
below are step I am doing on my code:
step-1 : read from cassandra table:
cassandra_data = spark_session.read \
.format('org.apache.spark.sql.cassandra') \
.options(table=table, keyspace=keyspace) \
.load()
return data
step -2 : add a column in data-frame read from cassandra that has value of md5 over entire row .
data_wth_hash = prepare_data_md5(cassandra_data)
data_wth_hash.cache()
data_wth_hash.count()
step -3 : write into aws s3 folder .
Job is taking much more time while writing into s3 , I do not have access to spark history ui to understand where it is consuming more time.

What happens when you restart a spark job if it encounters unexpected format in the data fed to kafka

I have a question regarding Spark Structured Streaming with Kafka.
Suppose that I am running a spark job and every thing is working perfectly.
One fine day, my spark job fails because of inconsistencies in data that is fed to kafka. Inconsistencies may be anything like data format issues or junk characters which spark couldn't have processed. In such case, how do we fix the issue? Is there a way we can get into the kafka topic and make changes to the data manually?
If we don't fix the data issue and restart the spark job, it will read the same old row which contributed to failure since we have not yet committed the checkpoint. so how do we get out of this loop. How to fix the data issue in Kafka topic for resuming the aborted spark job?
I would avoid trying to manually change one single message within a Kafka topic unless you really know what you are doing.
To prevent this from happening in the future, you might want to consider using a schema for your data (in combination with a schema registry).
For mitigating the problem you described I see the following options:
Manually change the offset of the Consumer Group of your structured streaming application
create a "new" streaming job that starts reading from a particular offset
Manually change offset
When using Sparks structured streaming the consumer group is automatically set by Spark. According to the code the Consumer Group will be defined as:
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
You can change the offset by using the kafka-consumer-groups tool. First identify the actual name of the consumer group by
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
and then set the offset for that consumer group for a particular topic (e.g. offset 100)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group spark-kafka-source-1337 --topic topic1 --to-offset 100
If you need to change the offset only for a particular partition you can have a look at the help function of the tool on how to do this.
Create new Streaming Job
You could make use of the Spark option startingOffsets as describe in the Spark + Kafka integration guide:
Option: startingOffsets
value: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
default: "latest" for streaming, "earliest" for batch
meaning: The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
For this to work, it is important to have a "new" query. That means you need to delete your checkpoint files of your existing job or create complete new application.

Restarting a PySpark job doesn't get the records which were inserted into Kafka Topic while the pyspark consumer is down

I am running a pyspark job and the data streaming is from Kafka.
I am trying to replicate a scenario in my windows system to find out what happens when the consumer goes down while the data is continuously being fed into Kafka.
Here is what i expect.
producer is started and produces message 1, 2 and 3.
consumer is online and consumes messages 1, 2 and 3.
Now the consumer goes down for some reason while the producer produces messages 4, 5 and 6 and so on...
when the consumer comes up, it is my expectation that it should read where it left off. So the consumer must be able to read from message 4, 5 , 6 and so on....
My pyspark application is not able to achieve what i expect. here is how I created a Spark Session.
session.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "clickapijson")
.option("startingoffsets" , "latest") \
.load()
I googled and gathered quite a bit of information. It seems like the groupID is relevant here. Kafka maintains the track of offsets read by each consumer in a particular groupID. If a consumer subscribes to a topic with a groupId, say, G1, kafka registers this group and consumerID and keeps a track of this groupID and ConsumerID. If at all, the consumer has to go down for some reason, and restarts with the same groupID, then the kafka will have the information of the already read offsets so the consumer will read the data from where it left off.
This is exactly happening when i use the following command to invoke consumer job in CLI.
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic "clickapijson" --consumer-property group.id=test
Now when my producer produces the messages 1,2 and 3, consumer is able to consume. I killed the running consumer job(CLI .bat file) after the 3 rd message is read. My producer produces the message 4, 5 and 6 and so on....
Now I bring back my consumer job (CLI .bat file) and it able to read the data from where it left off ( from message 4). This is behaving as I expect.
I am unable to do the same thing in pyspark.
when I am including the option("group.id" , "test"), it throws an error saying Kafka option group.id is not supported as user-specified consumer groups are not used to track offsets.
Upon observing the console output, each time my pyspark consumer job is kicked off, it is creating a new groupID. If my pyspark job has run previously with a groupID and failed, when it is restarted it is not picking up the same old groupID. It is randomly getting a new groupID. Kafka has the offset information of the previous groupID but not the current newly generated groupID. Hence my pyspark application is not able to read the data fed into Kafka while it was down.
If this is the case, then wont I lose my data when the consumer job has gone down due to some failure?
How can i give my own groupid to the pyspark application or how can i restart my pyspark application with same old groupid?
In the current Spark version (2.4.5) it is not possible to provide your own group.id as it gets automatically created by Spark (as you already observed). The full details on the offset management in Spark reading from Kafka is given here and summarised below:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
group.id: Kafka source will create a unique group id for each query automatically.
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
enable.auto.commit: Kafka source doesn’t commit any offset.
For Spark to be able to remember where it left off reading from Kafka, you need to have checkpointing enabled and provide a path location to store the checkpointing files. In Python this would look like:
aggDF \
.writeStream \
.outputMode("complete") \
.option("checkpointLocation", "path/to/HDFS/dir") \
.format("memory") \
.start()
More details on checkpointing are given in the Spark docs on Recovering from Failures with Checkpointing.

Alternate to recursively Running Spark-submit jobs

Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Resources