how can spark streaming rerun the failed batch job - apache-spark

my question is that.
I use spark-streaming read data from kafka with directSteam Api,process rdd then update zookeeper offset manually.
data from kafka will read and insert into hive table.
now I meet a question.
sometime the hive-meta store process exit for some reason.(now the hive-metastore is single)
some batch job will fail for that reason and the spark streaming job won't exit just log some warning.
then when I restart the hive metastore process, the program go on and the new batch job will succeed.
but I find that the failed batch read data from kafka is missing.
I see the meta data from the job detail.
image that one batch job read 20 offset from kafka.
the batch1 job read offset 1 20,
the batch2 job read offset 21 40
if batch1 job fail ,the batch2 succeed, the failed1 job 's data will missied.
how can I do this?
how can I rerun the failed batch job?

Related

gracefully exit without running pending batches in spark Dstreams

I am trying spark Dstreams with Kafka.
I am not able to commit for pending/ lag batches that are consumed by spark Dstream. This issues happens after the spark streaming context is stopped using ssc.stop(true,true). Here in this case the streaming context is stopped but spark context is still running the pending batched.
Here are a few things I have done.
create Dstream to get data from kafka topic. (Successfullyy)
commit offset manually back to kafka using
stream.asInstanceOf[canCommitOffsets].commitAsync()
Batch Interval is 60 seconds.
Batch time (time taken to perform some operation on incoming data ) is 2 mins.
streamingContext.stop(true,true)
Please tell me if there is a way to commit the offset for pending batches as well, or gracefully exit after the currently running batch and discard the pending batches, that way the offset for pending batches is not commited.

How to figure out Kafka startingOffsets and endingOffsets in a scheduled Spark batch job?

I am trying to read from a Kafka topic in my Spark batch job and publish to another topic. I am not using streaming because it does not fit our use case. According to the spark docs, the batch job starts reading from the earliest Kafka offsets by default, and so when I run the job again, it again reads from the earliest. How do I make sure that the job picks up the next offset from where it last read?
According to the Spark Kafka Integration docs, there are options to specify "startingOffsets" and "endingOffsets". But how do I figure them out?
I am using the spark.read.format("kafka") API to read the data from Kafka as a Dataset. But I did not find any option to get the start and end offset range from this Dataset read.

Alternate to recursively Running Spark-submit jobs

Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html

How to write functional Test with spark

I have a spark batch job that talks with cassandra. After the batch job gets completed , I need to verify few entries in cassandra and the cycle continues for 2-3 times. How do I know when the batch job ends ? I don't want to track the status of batch job by adding entry in db.
How to write functional test in spark ?

spark running 10 hrs even after kafka showing 0 message lag

I am running spark streaming and it is consuming message from kafka.I have also defined checkpoint directory in my spark code.
We did a bulk message upload in kafka yesterday. When I check the offset status in kafka using -
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group xxx- \
streaming-consumer-group --zookeeper xxx.xxx.xxx.xxx:2181
It shows there is no message lag. However, my spark job is still running for last 10 hrs.
My understanding is spark-streaming code should read the messages sequentially and it should update offset in kafka accordingly.
I am not able to figure out why spark is still running even if there is no message lag in kafka. Can someone explain?

Resources